Interpret Linear Regression Model in Python

Demonstration of EZInterpret.

Start with loading in necessary packages and dataset.

import pandas as pd
from statsmodels.formula.api import ols
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import ezinterpret
df = pd.read_csv('https://raw.githubusercontent.com/sik-flow/datasets/master/bike.csv')

Do some small data cleaning and feature engineering.

df['summer'] = df['season'].map(lambda x: 1 if x == 3 else 0)
df['fall'] = df['season'].map(lambda x: 1 if x == 4 else 0)
df['winter'] = df['season'].map(lambda x: 1 if x == 1 else 0)

df['misty'] = df['weathersit'].map(lambda x: 1 if x == 2 else 0)
df['rain_snow_storm'] = df['weathersit'].map(lambda x: 1 if x > 2 else 0)

df['dteday'] = pd.to_datetime(df['dteday'])

df['days_since_2011'] = df['dteday'].map(lambda x: (x - df.loc[0, 'dteday']).days)

df['temp'] = df['temp'] * (39 - (-8)) + (-8)
df['windspeed'] = 67 * df['windspeed']
df['hum'] = df['hum'] * 100

Build my linear regression model

formula = "cnt ~ summer+fall+winter+holiday+workingday+misty+rain_snow_storm+temp+hum+windspeed+days_since_2011"
model = ols(formula= formula, data=df).fit()

model.summary()
OLS Regression Results
Dep. Variable:cnt R-squared: 0.794
Model:OLS Adj. R-squared: 0.790
Method:Least Squares F-statistic: 251.2
Date:Mon, 10 Aug 2020 Prob (F-statistic):1.05e-237
Time:21:01:57 Log-Likelihood: -5993.0
No. Observations: 731 AIC:1.201e+04
Df Residuals: 719 BIC:1.207e+04
Df Model: 11
Covariance Type:nonrobust
coefstd errtP>|t|[0.0250.975]
Intercept 3298.7604 262.145 12.584 0.000 2784.099 3813.422
summer -761.1027 107.169 -7.102 0.000 -971.505 -550.701
fall -473.7153 109.947 -4.309 0.000 -689.570 -257.860
winter -899.3182 122.283 -7.354 0.000-1139.393 -659.243
holiday -686.1154 203.301 -3.375 0.001-1085.251 -286.980
workingday 124.9209 73.267 1.705 0.089 -18.921 268.763
misty -379.3985 87.553 -4.333 0.000 -551.289 -207.508
rain_snow_storm-1901.5399 223.640 -8.503 0.000-2340.605-1462.475
temp 110.7096 7.043 15.718 0.000 96.882 124.537
hum -17.3772 3.169 -5.483 0.000 -23.600 -11.155
windspeed -42.5135 6.892 -6.169 0.000 -56.044 -28.983
days_since_2011 4.9264 0.173 28.507 0.000 4.587 5.266
Omnibus:91.525 Durbin-Watson: 0.911
Prob(Omnibus): 0.000 Jarque-Bera (JB): 194.706
Skew:-0.719 Prob(JB):5.25e-43
Kurtosis: 5.079 Cond. No.3.74e+03



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.74e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Make an instance of ezinterpret and pass in my model.

ins = ezinterpret.ez_linear(model)

First I am going to look at the feature importance. This is based on the T-Statistic.

ins.feature_importance();

png

Next I am going to look at the weight plots. This is based on the coefficients.

ins.weight_plot();

png

We see that days_since_2011 has a coefficient near 0 even though it is the most important feature. This is because days_since_2011 (the raw number) is a big value. A better way to look at it is with an effect plot which shows the raw data multiplied by the coefficient.

To run this I need to pass in a dictionary with any one hot encoded features.

cats = {'weather': ['misty', 'rain_snow_storm'], 'season': ['summer', 'fall', 'winter']}
ins.effect_plot(df, cats);

png

Now we see the big range that days_since_2011 and temperate have on number of bikes rented. We can see why these are the two most important features.

Finally, I am going to overlay a local prediction on the effect plot.

test_case = df.loc[302].to_dict()
ins.effect_plot_with_local_pred(df, cats, test_case, 'cnt');

png

Here our model predicts there to be 3,501 bikes rented when there were actually 3,331. The red x’s are where the values for the specific case lie with regards to the rest of the data.