Interpret Linear Regression Model in Python
Demonstration of EZInterpret.
Start with loading in necessary packages and dataset.
import pandas as pd
from statsmodels.formula.api import ols
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ezinterpret
df = pd.read_csv('https://raw.githubusercontent.com/sik-flow/datasets/master/bike.csv')
Do some small data cleaning and feature engineering.
df['summer'] = df['season'].map(lambda x: 1 if x == 3 else 0)
df['fall'] = df['season'].map(lambda x: 1 if x == 4 else 0)
df['winter'] = df['season'].map(lambda x: 1 if x == 1 else 0)
df['misty'] = df['weathersit'].map(lambda x: 1 if x == 2 else 0)
df['rain_snow_storm'] = df['weathersit'].map(lambda x: 1 if x > 2 else 0)
df['dteday'] = pd.to_datetime(df['dteday'])
df['days_since_2011'] = df['dteday'].map(lambda x: (x - df.loc[0, 'dteday']).days)
df['temp'] = df['temp'] * (39 - (-8)) + (-8)
df['windspeed'] = 67 * df['windspeed']
df['hum'] = df['hum'] * 100
Build my linear regression model
formula = "cnt ~ summer+fall+winter+holiday+workingday+misty+rain_snow_storm+temp+hum+windspeed+days_since_2011"
model = ols(formula= formula, data=df).fit()
model.summary()
Dep. Variable: | cnt | R-squared: | 0.794 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.790 |
Method: | Least Squares | F-statistic: | 251.2 |
Date: | Mon, 10 Aug 2020 | Prob (F-statistic): | 1.05e-237 |
Time: | 21:01:57 | Log-Likelihood: | -5993.0 |
No. Observations: | 731 | AIC: | 1.201e+04 |
Df Residuals: | 719 | BIC: | 1.207e+04 |
Df Model: | 11 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 3298.7604 | 262.145 | 12.584 | 0.000 | 2784.099 | 3813.422 |
summer | -761.1027 | 107.169 | -7.102 | 0.000 | -971.505 | -550.701 |
fall | -473.7153 | 109.947 | -4.309 | 0.000 | -689.570 | -257.860 |
winter | -899.3182 | 122.283 | -7.354 | 0.000 | -1139.393 | -659.243 |
holiday | -686.1154 | 203.301 | -3.375 | 0.001 | -1085.251 | -286.980 |
workingday | 124.9209 | 73.267 | 1.705 | 0.089 | -18.921 | 268.763 |
misty | -379.3985 | 87.553 | -4.333 | 0.000 | -551.289 | -207.508 |
rain_snow_storm | -1901.5399 | 223.640 | -8.503 | 0.000 | -2340.605 | -1462.475 |
temp | 110.7096 | 7.043 | 15.718 | 0.000 | 96.882 | 124.537 |
hum | -17.3772 | 3.169 | -5.483 | 0.000 | -23.600 | -11.155 |
windspeed | -42.5135 | 6.892 | -6.169 | 0.000 | -56.044 | -28.983 |
days_since_2011 | 4.9264 | 0.173 | 28.507 | 0.000 | 4.587 | 5.266 |
Omnibus: | 91.525 | Durbin-Watson: | 0.911 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 194.706 |
Skew: | -0.719 | Prob(JB): | 5.25e-43 |
Kurtosis: | 5.079 | Cond. No. | 3.74e+03 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.74e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Make an instance of ezinterpret and pass in my model.
ins = ezinterpret.ez_linear(model)
First I am going to look at the feature importance. This is based on the T-Statistic.
ins.feature_importance();
Next I am going to look at the weight plots. This is based on the coefficients.
ins.weight_plot();
We see that days_since_2011
has a coefficient near 0 even though it is the most important feature. This is because days_since_2011
(the raw number) is a big value. A better way to look at it is with an effect plot which shows the raw data multiplied by the coefficient.
To run this I need to pass in a dictionary with any one hot encoded features.
cats = {'weather': ['misty', 'rain_snow_storm'], 'season': ['summer', 'fall', 'winter']}
ins.effect_plot(df, cats);
Now we see the big range that days_since_2011
and temperate have on number of bikes rented. We can see why these are the two most important features.
Finally, I am going to overlay a local prediction on the effect plot.
test_case = df.loc[302].to_dict()
ins.effect_plot_with_local_pred(df, cats, test_case, 'cnt');
Here our model predicts there to be 3,501 bikes rented when there were actually 3,331. The red x
’s are where the values for the specific case lie with regards to the rest of the data.