Interpret Linear Regression Model in Python

August 15, 2020

Demonstration of EZInterpret.

Start with loading in necessary packages and dataset.

import pandas as pd
from statsmodels.formula.api import ols
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import ezinterpret

df = pd.read_csv('https://raw.githubusercontent.com/sik-flow/datasets/master/bike.csv')

Do some small data cleaning and feature engineering.

df['summer'] = df['season'].map(lambda x: 1 if x == 3 else 0)
df['fall'] = df['season'].map(lambda x: 1 if x == 4 else 0)
df['winter'] = df['season'].map(lambda x: 1 if x == 1 else 0)

df['misty'] = df['weathersit'].map(lambda x: 1 if x == 2 else 0)
df['rain_snow_storm'] = df['weathersit'].map(lambda x: 1 if x > 2 else 0)

df['dteday'] = pd.to_datetime(df['dteday'])

df['days_since_2011'] = df['dteday'].map(lambda x: (x - df.loc[0, 'dteday']).days)

df['temp'] = df['temp'] * (39 - (-8)) + (-8)
df['windspeed'] = 67 * df['windspeed']
df['hum'] = df['hum'] * 100

Build my linear regression model

formula = "cnt ~ summer+fall+winter+holiday+workingday+misty+rain_snow_storm+temp+hum+windspeed+days_since_2011"
model = ols(formula= formula, data=df).fit()

model.summary()

OLS Regression Results
Dep. Variable:	cnt	R-squared:	0.794
Model:	OLS	Adj. R-squared:	0.790
Method:	Least Squares	F-statistic:	251.2
Date:	Mon, 10 Aug 2020	Prob (F-statistic):	1.05e-237
Time:	21:01:57	Log-Likelihood:	-5993.0
No. Observations:	731	AIC:	1.201e+04
Df Residuals:	719	BIC:	1.207e+04
Df Model:	11
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3298.7604	262.145	12.584	0.000	2784.099	3813.422
summer	-761.1027	107.169	-7.102	0.000	-971.505	-550.701
fall	-473.7153	109.947	-4.309	0.000	-689.570	-257.860
winter	-899.3182	122.283	-7.354	0.000	-1139.393	-659.243
holiday	-686.1154	203.301	-3.375	0.001	-1085.251	-286.980
workingday	124.9209	73.267	1.705	0.089	-18.921	268.763
misty	-379.3985	87.553	-4.333	0.000	-551.289	-207.508
rain_snow_storm	-1901.5399	223.640	-8.503	0.000	-2340.605	-1462.475
temp	110.7096	7.043	15.718	0.000	96.882	124.537
hum	-17.3772	3.169	-5.483	0.000	-23.600	-11.155
windspeed	-42.5135	6.892	-6.169	0.000	-56.044	-28.983
days_since_2011	4.9264	0.173	28.507	0.000	4.587	5.266

Omnibus:	91.525	Durbin-Watson:	0.911
Prob(Omnibus):	0.000	Jarque-Bera (JB):	194.706
Skew:	-0.719	Prob(JB):	5.25e-43
Kurtosis:	5.079	Cond. No.	3.74e+03

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.74e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Make an instance of ezinterpret and pass in my model.

ins = ezinterpret.ez_linear(model)

First I am going to look at the feature importance. This is based on the T-Statistic.

ins.feature_importance();

png

Next I am going to look at the weight plots. This is based on the coefficients.

ins.weight_plot();

png

We see that days_since_2011 has a coefficient near 0 even though it is the most important feature. This is because days_since_2011 (the raw number) is a big value. A better way to look at it is with an effect plot which shows the raw data multiplied by the coefficient.

To run this I need to pass in a dictionary with any one hot encoded features.

cats = {'weather': ['misty', 'rain_snow_storm'], 'season': ['summer', 'fall', 'winter']}
ins.effect_plot(df, cats);

png

Now we see the big range that days_since_2011 and temperate have on number of bikes rented. We can see why these are the two most important features.

Finally, I am going to overlay a local prediction on the effect plot.

test_case = df.loc[302].to_dict()

ins.effect_plot_with_local_pred(df, cats, test_case, 'cnt');

png

Here our model predicts there to be 3,501 bikes rented when there were actually 3,331. The red x’s are where the values for the specific case lie with regards to the rest of the data.