Today's Exercise

We will fit a logistic regression model to a sample of data from the mortgage dataset we've used throughout the semester.

From millions of mortgages, you have a sample of 5000. Let's pretend these are the mortgage applications from the lender for which you work.

The primary questions we'll answer with our model are these:

  1. Can we use a computer model to give a reasonable prediction of whether a mortgage application will be approved just based on a few variables? (If so, the lender can create an online system where customers can type in some details and get a "likely to be approved" or "not likely to be approved" response immediately.)
  2. Is sex is a major contributing variable in the model? (If so, the lender wants to review their existing practices to address any existing discrimination.)

Competition

To make things fun, we will run today like a Kaggle competition.

  • You've been given the training data in advance, and can create your model using any variables you wish within that data.
  • Near the end of class, I'll share the test data, and you can score your model on that.

We will judge models using the $F_1$ score introduced in this section of the course notes.

Discussion

Question 1. Why do we split a dataset into train/test data parts?

Question 2. Isn't it better to fit a model on all the data you have?

Breakout groups - phase 1

Each student should work on their own model, so that you can master the concepts yourself, rather than just watch someone else. But of course everyone in a group is encouraged to discuss and help one another as much as possible!

Take these steps in breakout groups now:

  1. Load your cleaned dataset into a new Jupyter notebook or Python script.
  2. Load scikit-learn's logistic regression tools as discussed in this section of the course notes and use them to fit a model to your training data. Use all variables in this model.
  3. Write formulas that compute the precision, recall, and $F_1$ score for your model.

Discussion

Check: You should have found the following values, based on the training data I gave you.

  • $\text{precision} \approx 0.6455$
  • $\text{recall} \approx 0.6496$
  • $F_1 \approx 0.6475$

Question 3. None of these measurements say how well the model performs on unseen data. We will evaluate that in the test phase, at the end of our competition. But what if you wanted, right now, to test whether the model works well on unseen data? What might you do?

Breakout groups - phase 2

Take these steps in breakout groups now:

  1. Split your training data into two parts, 80% and 20%. But instead of calling them training and testing data, call them training and validation data. (I still haven't given you the test data.)
  2. Train the model on just the (new, smaller) training data, and compute its $F_1$ scores on both the (new, smaller) training data and the validation data, separately.
    • I got $F_1\approx0.698$ and $F_1\approx0.677$, but this will vary based on your random training/validation split.
  3. Abstract all of your work into two functions, so that you can do model fitting and scoring with two function calls, model = fit_model_to( df_training ) and F1 = score_model( model, df_validation ).
    • If you use this on the same training/validation split, you should get the same $F_1$ scores as you did before creating the functions.

Discussion

Question 4. We can use the training/validation split to check whether a model is overfitting the training data. Do you have any evidence that there is overfitting in your case?

Question 5. If there were evidence of overfitting, how might we simplify the model to reduce that problem?

Break

We will take our mid-class break here.

Discussion

Question 6. How can we measure which variables are the most/least important in a model?

Question 7. What does this have to do with overfitting or underfitting?

Breakout groups - phase 3

Take this step and the ones on the next slide in breakout groups now:

  1. Improve your fit_model_to() function so that, in addition to fitting a model to the provided predictors, it does these two things:
    • Fit a separate model to the standardized predictors, so that we can look at that second model's coefficients.
    • Print out the coefficients for that second model. (You will still return the original model; this second model is only for printing standardized coefficients.)
    • Optional bonus: Can you print the coefficients in decreasing order of absolute value?

Breakout groups - phase 3, continued

  1. Choose one predictor to omit from the model, so that we can test whether the model will perform better on a subset of the features.
  2. Declare a variable columns that contains a list of all the columns except the one you want to omit. (You might choose to use their indices instead of their names, to save some typing.)
    • Be sure to include the target variable in the list of columns you want to keep!
  3. Run your fit_model_to() and score_model() functions again, this time using only your chosen subset of the predictors.

Discussion

Question 8. Which predictor did you choose to omit and why?

Question 9. Did your new model seem to generalize better to unseen data than the original did?

Question 10. Our results depend on the particular random training/validation splits that NumPy selected for us. That should make us wonder whether they are reliable or just coincidental. How might we address that concern?

Breakout groups - phase 4, preparing for the competition

Experiment with various subsets of the predictor columns until you find a subset that generalizes well to unseen data. It could contain all the predictors, just one of them, or anything in between.

In actual modeling work, there are principled ways to go about choosing a good subset of the predictors. In this small preview of machine learning, we will do it just by experimentation. See MA252 or MA347 for more principled methods.

(Friendly) Competition

  • The columns you've chosen to use are the model you will be entering into the (friendly) competition (with no prize) that will happen now.
  • I will give you the testing data, and we will judge each model based on its $F_1$ score on the testing data.
  • Honor system: From this point forward, don't change the set of predictor columns you've chosen to use.

How to enter the competition

  1. Open the Jupyter notebook or Python script you used when doing the preparatory homework for today.
  2. Change the filenames in your code so that it reads mortgage-testing-data.csv, cleans it, then writes cleaned-mortgage-testing-data.csv. (Take care not to save it over top of your training data!)
  3. After the end of the code you wrote today, add code that loads the cleaned training data as a new DataFrame.
  4. Run score_model() to apply your already-chosen model to the test data you just received.
  5. Note the $F_1$ score it shows as output. This score is your entry in the competition.

(Do NOT train your model on the testing data! You're checking to see how well the model you already trained on known data generalizes to unseen data.)