We will fit a logistic regression model to a sample of data from the mortgage dataset we've used throughout the semester.
From millions of mortgages, you have a sample of 5000. Let's pretend these are the mortgage applications from the lender for which you work.
The primary questions we'll answer with our model are these:
To make things fun, we will run today like a Kaggle competition.
We will judge models using the $F_1$ score introduced in this section of the course notes.
Question 1. Why do we split a dataset into train/test data parts?
Question 2. Isn't it better to fit a model on all the data you have?
Each student should work on their own model, so that you can master the concepts yourself, rather than just watch someone else. But of course everyone in a group is encouraged to discuss and help one another as much as possible!
Take these steps in breakout groups now:
Check: You should have found the following values, based on the training data I gave you.
Question 3. None of these measurements say how well the model performs on unseen data. We will evaluate that in the test phase, at the end of our competition. But what if you wanted, right now, to test whether the model works well on unseen data? What might you do?
Take these steps in breakout groups now:
model = fit_model_to( df_training )
and F1 = score_model( model, df_validation )
.Question 4. We can use the training/validation split to check whether a model is overfitting the training data. Do you have any evidence that there is overfitting in your case?
Question 5. If there were evidence of overfitting, how might we simplify the model to reduce that problem?
We will take our mid-class break here.
Question 6. How can we measure which variables are the most/least important in a model?
Question 7. What does this have to do with overfitting or underfitting?
Take this step and the ones on the next slide in breakout groups now:
fit_model_to()
function so that, in addition to fitting a model to the provided predictors, it does these two things:columns
that contains a list of all the columns except the one you want to omit. (You might choose to use their indices instead of their names, to save some typing.)fit_model_to()
and score_model()
functions again, this time using only your chosen subset of the predictors.Question 8. Which predictor did you choose to omit and why?
Question 9. Did your new model seem to generalize better to unseen data than the original did?
Question 10. Our results depend on the particular random training/validation splits that NumPy selected for us. That should make us wonder whether they are reliable or just coincidental. How might we address that concern?
Experiment with various subsets of the predictor columns until you find a subset that generalizes well to unseen data. It could contain all the predictors, just one of them, or anything in between.
In actual modeling work, there are principled ways to go about choosing a good subset of the predictors. In this small preview of machine learning, we will do it just by experimentation. See MA252 or MA347 for more principled methods.
mortgage-testing-data.csv
, cleans it, then writes cleaned-mortgage-testing-data.csv
. (Take care not to save it over top of your training data!)score_model()
to apply your already-chosen model to the test data you just received.(Do NOT train your model on the testing data! You're checking to see how well the model you already trained on known data generalizes to unseen data.)