Interpretting Single Variable vs Multiple Logistic Regression

In this post, I am going to show how the signs can change as we add more features to a logistic regression equation. I am also going to explain why the signs flip.

To start I am loading in the Default dataset.

import pandas as pd 

df = pd.read_csv('https://raw.githubusercontent.com/sik-flow/datasets/master/Default.csv')
df.head()
defaultstudentbalanceincome
000729.52649544361.625074
101817.18040712106.134700
2001073.54916431767.138947
300529.25060535704.493935
400785.65588338463.495879

I am first going to build a model using student (a categorical feature) to predict whether someone defaults on a loan or not.

import statsmodels.api as sm

X = df[['student']]
y = df['default']

X = sm.add_constant(X)

logit = sm.Logit(y, X).fit()
logit.summary()
/Users/jeffreyherman/opt/anaconda3/lib/python3.7/site-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm


Optimization terminated successfully.
         Current function value: 0.145434
         Iterations 7
Logit Regression Results
Dep. Variable:default No. Observations: 10000
Model:Logit Df Residuals: 9998
Method:MLE Df Model: 1
Date:Sat, 01 Aug 2020 Pseudo R-squ.:0.004097
Time:21:12:23 Log-Likelihood: -1454.3
converged:True LL-Null: -1460.3
Covariance Type:nonrobust LLR p-value:0.0005416
coefstd errzP>|z|[0.0250.975]
const -3.5041 0.071 -49.554 0.000 -3.643 -3.366
student 0.4049 0.115 3.520 0.000 0.179 0.630

I see that the coefficient for student is positive which means that a student has a probability of defaulting than a non-student I can check this with the following.

import numpy as np

# probabiliy of student 
np.exp(-3.5041 + 0.4049) / (1 + np.exp(-3.5041 + 0.4049))
0.04314026622102699
# probabiliy of non-student 
np.exp(-3.5041) / (1 + np.exp(-3.5041))
0.029195798210381152

We see the probability of a student to default is 4.3% and a non-student is 2.9% from our logistic regression model. This should line up with what our original data is. We can check this with the following:

# student 
df[df['student'] == 1]['default'].value_counts(normalize = True)
0    0.956861
1    0.043139
Name: default, dtype: float64
# non-student 
df[df['student'] == 0]['default'].value_counts(normalize = True)
0    0.970805
1    0.029195
Name: default, dtype: float64

We see that it does line up with our logistic regression model. Now lets see what happens when I add an additional feature balance to the model.

X = df[['student', 'balance']]
y = df['default']

X = sm.add_constant(X)

logit = sm.Logit(y, X).fit()
logit.summary()
Optimization terminated successfully.
         Current function value: 0.078584
         Iterations 10
Logit Regression Results
Dep. Variable:default No. Observations: 10000
Model:Logit Df Residuals: 9997
Method:MLE Df Model: 2
Date:Sat, 01 Aug 2020 Pseudo R-squ.:0.4619
Time:21:12:23 Log-Likelihood: -785.84
converged:True LL-Null: -1460.3
Covariance Type:nonrobust LLR p-value:1.189e-293
coefstd errzP>|z|[0.0250.975]
const -10.7495 0.369 -29.115 0.000 -11.473 -10.026
student -0.7149 0.148 -4.846 0.000 -1.004 -0.426
balance 0.0057 0.000 24.748 0.000 0.005 0.006


We now see that student has a negative coefficient, which means that students when balance is held constant students default at a lower rate than non-students. This contradicts what our single variable logistic regression models says. Lets take a look at why this happens.

I am first going to build a model that shows the rate at which students and non-students default at different balances.

import seaborn as sns 
import matplotlib.pyplot as plt
my_vals = [0] + [i for i in range(500, 2500, 100)] + [3000]
student = []
non_student = []
for counter, x in enumerate(my_vals):
    if counter + 1 < len(my_vals):
        stud_val = df[(df['student']==1) & (df['balance'] > x) & (df['balance'] < my_vals[counter+1])]['default']
        student.append((stud_val == 1).sum() / len(stud_val))
        nstud_val = df[(df['student']==0) & (df['balance'] > x) & (df['balance'] < my_vals[counter+1])]['default']
        non_student.append((nstud_val == 1).sum() / len(nstud_val))
plt.figure(figsize = (12, 8))
plt.plot(range(500, 2600, 100), non_student, color = 'red', label = 'non-student')
plt.plot(range(500, 2600, 100), student, color = 'blue', label = 'student')
plt.axhline(df[df['student'] == 0]['default'].value_counts(normalize = True)[1], 
           color = 'red', ls = '--')
plt.axhline(df[df['student'] == 1]['default'].value_counts(normalize = True)[1], 
           color = 'blue', ls = '--')
plt.legend(fontsize = 16)
plt.xlabel('Credit Card Balance', fontsize = 18)
plt.ylabel('Default Rate', fontsize = 18)
plt.xticks(fontsize = 16)
plt.yticks(fontsize = 16);

png

The horizontal lines are the overall default rate for students (blue) and non-students. We see that the blue dotted line is above the red dotted line, this is why when looking at the single variable logistic regression it showed that students were more likely to default than non-students. The line plot shows that at the same credit card balance, students are more likely to not default than non-students.

Additionally, I want to look at the balance distribution for students and non-students.

plt.figure(figsize = (12, 8))
sns.boxplot(x = 'student', y = 'balance', data = df, palette =  ['r', 'b'])
plt.xticks(range(2), ['non-student', 'student'], fontsize = 16)
plt.xlabel('')
plt.ylabel('Credit Card Balance', fontsize = 18)
plt.yticks(fontsize = 16);

png

We see that students on average have a higher balance than non-students. But, we see when a student and a non-student have the same balance - the student is less likely to default.

Lets look at the probability of default for a student and non-student when they both have a $2000 balance.

# student
np.exp(-10.75 + 0.0057 * 2000 - 0.7149) / (1 + np.exp(-10.75 + 0.0057 * 2000 - 0.7149))
0.483780692590808
# non-student
np.exp(-10.75 + 0.0057 * 2000) / (1 + np.exp(-10.75 + 0.0057 * 2000))
0.6570104626734988

A student has a probability of defaulting at 48% and a non-student has a probability of defaulting of 66%.

References