Finding High Leverage Points with Cook's Distance

High leverage points are points that have an unusual value for X. It is important to identify these points because they can have a large impact on linear regression models.

I am going to begin by creating a dataset.

from sklearn.datasets import make_regression
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
X, y = make_regression(n_features = 1, noise=10, random_state=11)

I am adding a data point that has an unusual data point.

X = np.append(X, [3])
y = np.append(y, [70])

df = pd.DataFrame(X, columns=['X'])
df['Y'] = y

Plot my data and highlighting the unusual point as red.

plt.scatter(df['X'], df['Y'])
plt.scatter([3], [70], color = 'red')
plt.xlabel('X')
plt.ylabel('Y');

png

We see that the red point has the largest value of X and doesn’t fit the general trend of the rest of the data. I am going to use Cook’s Distance to measure how much influence each point has. Any point that has a large influence I want to be careful before I include it in my model.

from yellowbrick.regressor import CooksDistance

visualizer = CooksDistance()
visualizer.fit(df['X'].values.reshape(-1, 1), df['Y'])
/Users/jeffreyherman/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.metrics.classification module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.
  warnings.warn(message, FutureWarning)
/Users/jeffreyherman/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:213: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)
/Users/jeffreyherman/opt/anaconda3/lib/python3.7/site-packages/yellowbrick/regressor/influence.py:183: UserWarning: In Matplotlib 3.3 individual lines on a stem plot will be added as a LineCollection instead of individual lines. This significantly improves the performance of a stem plot. To remove this warning and switch to the new behaviour, set the "use_line_collection" keyword argument to True.
  self.distance_, linefmt=self.linefmt, markerfmt=self.markerfmt





CooksDistance(ax=<matplotlib.axes._subplots.AxesSubplot object at 0x1c1f78ba50>)

png

df['Distance'] = visualizer.distance_
df.sort_values('Distance', ascending = False).head()
XYDistance
1003.00000070.0000003.658567
30-2.653319-159.9176920.307768
851.84636564.5086690.135441
552.156674123.4918550.087477
941.40277147.5319920.048779

We see that point 100 has a Cook’s Distance that is the largest (typically any point with a Cook’s Distance greater than 1 I will want to investigate).

Lets see what happens to our regression when we keep a point that has high leverage. I am going to build 2 regression models - the first one will have the high leverage point and the second one will not have the high leverage point.

import statsmodels.api as sm

X = df['X']
Y = df['Y']

# coefficient 
X = sm.add_constant(X)

model_1 = sm.OLS(Y, X).fit()
model_1.summary()
/Users/jeffreyherman/opt/anaconda3/lib/python3.7/site-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
OLS Regression Results
Dep. Variable:Y R-squared: 0.926
Model:OLS Adj. R-squared: 0.926
Method:Least Squares F-statistic: 1246.
Date:Tue, 28 Jul 2020 Prob (F-statistic):6.87e-58
Time:14:56:18 Log-Likelihood: -407.59
No. Observations: 101 AIC: 819.2
Df Residuals: 99 BIC: 824.4
Df Model: 1
Covariance Type:nonrobust
coefstd errtP>|t|[0.0250.975]
const -1.5736 1.377 -1.143 0.256 -4.306 1.159
X 49.8307 1.412 35.294 0.000 47.029 52.632
Omnibus:62.244 Durbin-Watson: 1.371
Prob(Omnibus): 0.000 Jarque-Bera (JB): 403.422
Skew:-1.867 Prob(JB):2.50e-88
Kurtosis:12.051 Cond. No. 1.05



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

from sklearn.metrics import mean_squared_error

mean_squared_error(df['Y'], model_1.predict(X))
187.3842053021652
X = df.drop(100)['X']
Y = df.drop(100)['Y']

# coefficient 
X = sm.add_constant(X)

model_2 = sm.OLS(Y, X).fit()
model_2.summary()
OLS Regression Results
Dep. Variable:Y R-squared: 0.952
Model:OLS Adj. R-squared: 0.951
Method:Least Squares F-statistic: 1933.
Date:Tue, 28 Jul 2020 Prob (F-statistic):2.59e-66
Time:14:57:06 Log-Likelihood: -381.98
No. Observations: 100 AIC: 768.0
Df Residuals: 98 BIC: 773.2
Df Model: 1
Covariance Type:nonrobust
coefstd errtP>|t|[0.0250.975]
const -0.8252 1.115 -0.740 0.461 -3.037 1.386
X 52.5054 1.194 43.960 0.000 50.135 54.876
Omnibus: 7.213 Durbin-Watson: 1.615
Prob(Omnibus): 0.027 Jarque-Bera (JB): 9.901
Skew:-0.307 Prob(JB): 0.00708
Kurtosis: 4.414 Cond. No. 1.07



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

mean_squared_error(df.drop(100)['Y'], model_2.predict(X))
121.70966154143105

We see that the first model has a R^2 of 0.926 while the second model has a R^2 of 0.952. The first model has a MSE of 194.9 and the second model has a MSE of 121.7. Removing the high leverage improved both the R^2 and MSE. Now I am going to plot out both lines to see the impact.

Xnew = np.array([[1, df['X'].min()], [1, df['X'].max()]])
plt.scatter(df['X'], df['Y'])
plt.scatter([3], [70], color = 'red')
plt.plot([df['X'].min(), df['X'].max()], model_1.predict(Xnew), label = 'With Leverage Point', color = 'red')
plt.plot([df['X'].min(), df['X'].max()], model_2.predict(Xnew), label = 'Without Leverage Point')
plt.legend()
<matplotlib.legend.Legend at 0x1c2d5a9610>

png

The blue regression line (with the leverage point) is being pulled down towards that leverage point and is thus impacting the predictions for all the other data points.