Finding High Leverage Points with Cook's Distance

July 28, 2020

High leverage points are points that have an unusual value for X. It is important to identify these points because they can have a large impact on linear regression models.

I am going to begin by creating a dataset.

from sklearn.datasets import make_regression
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

X, y = make_regression(n_features = 1, noise=10, random_state=11)

I am adding a data point that has an unusual data point.

X = np.append(X, [3])
y = np.append(y, [70])

df = pd.DataFrame(X, columns=['X'])
df['Y'] = y

Plot my data and highlighting the unusual point as red.

plt.scatter(df['X'], df['Y'])
plt.scatter([3], [70], color = 'red')
plt.xlabel('X')
plt.ylabel('Y');

png

We see that the red point has the largest value of X and doesn’t fit the general trend of the rest of the data. I am going to use Cook’s Distance to measure how much influence each point has. Any point that has a large influence I want to be careful before I include it in my model.

from yellowbrick.regressor import CooksDistance

visualizer = CooksDistance()
visualizer.fit(df['X'].values.reshape(-1, 1), df['Y'])

/Users/jeffreyherman/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.metrics.classification module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.
  warnings.warn(message, FutureWarning)
/Users/jeffreyherman/opt/anaconda3/lib/python3.7/site-packages/sklearn/base.py:213: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
  FutureWarning)
/Users/jeffreyherman/opt/anaconda3/lib/python3.7/site-packages/yellowbrick/regressor/influence.py:183: UserWarning: In Matplotlib 3.3 individual lines on a stem plot will be added as a LineCollection instead of individual lines. This significantly improves the performance of a stem plot. To remove this warning and switch to the new behaviour, set the "use_line_collection" keyword argument to True.
  self.distance_, linefmt=self.linefmt, markerfmt=self.markerfmt

CooksDistance(ax=<matplotlib.axes._subplots.AxesSubplot object at 0x1c1f78ba50>)

png

df['Distance'] = visualizer.distance_
df.sort_values('Distance', ascending = False).head()

	X	Y	Distance
100	3.000000	70.000000	3.658567
30	-2.653319	-159.917692	0.307768
85	1.846365	64.508669	0.135441
55	2.156674	123.491855	0.087477
94	1.402771	47.531992	0.048779

We see that point 100 has a Cook’s Distance that is the largest (typically any point with a Cook’s Distance greater than 1 I will want to investigate).

Lets see what happens to our regression when we keep a point that has high leverage. I am going to build 2 regression models - the first one will have the high leverage point and the second one will not have the high leverage point.

import statsmodels.api as sm

X = df['X']
Y = df['Y']

# coefficient 
X = sm.add_constant(X)

model_1 = sm.OLS(Y, X).fit()
model_1.summary()

/Users/jeffreyherman/opt/anaconda3/lib/python3.7/site-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

OLS Regression Results
Dep. Variable:	Y	R-squared:	0.926
Model:	OLS	Adj. R-squared:	0.926
Method:	Least Squares	F-statistic:	1246.
Date:	Tue, 28 Jul 2020	Prob (F-statistic):	6.87e-58
Time:	14:56:18	Log-Likelihood:	-407.59
No. Observations:	101	AIC:	819.2
Df Residuals:	99	BIC:	824.4
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-1.5736	1.377	-1.143	0.256	-4.306	1.159
X	49.8307	1.412	35.294	0.000	47.029	52.632

Omnibus:	62.244	Durbin-Watson:	1.371
Prob(Omnibus):	0.000	Jarque-Bera (JB):	403.422
Skew:	-1.867	Prob(JB):	2.50e-88
Kurtosis:	12.051	Cond. No.	1.05

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

from sklearn.metrics import mean_squared_error

mean_squared_error(df['Y'], model_1.predict(X))

187.3842053021652

X = df.drop(100)['X']
Y = df.drop(100)['Y']

# coefficient 
X = sm.add_constant(X)

model_2 = sm.OLS(Y, X).fit()
model_2.summary()

OLS Regression Results
Dep. Variable:	Y	R-squared:	0.952
Model:	OLS	Adj. R-squared:	0.951
Method:	Least Squares	F-statistic:	1933.
Date:	Tue, 28 Jul 2020	Prob (F-statistic):	2.59e-66
Time:	14:57:06	Log-Likelihood:	-381.98
No. Observations:	100	AIC:	768.0
Df Residuals:	98	BIC:	773.2
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-0.8252	1.115	-0.740	0.461	-3.037	1.386
X	52.5054	1.194	43.960	0.000	50.135	54.876

Omnibus:	7.213	Durbin-Watson:	1.615
Prob(Omnibus):	0.027	Jarque-Bera (JB):	9.901
Skew:	-0.307	Prob(JB):	0.00708
Kurtosis:	4.414	Cond. No.	1.07

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

mean_squared_error(df.drop(100)['Y'], model_2.predict(X))

121.70966154143105

We see that the first model has a R^2 of 0.926 while the second model has a R^2 of 0.952. The first model has a MSE of 194.9 and the second model has a MSE of 121.7. Removing the high leverage improved both the R^2 and MSE. Now I am going to plot out both lines to see the impact.

Xnew = np.array([[1, df['X'].min()], [1, df['X'].max()]])
plt.scatter(df['X'], df['Y'])
plt.scatter([3], [70], color = 'red')
plt.plot([df['X'].min(), df['X'].max()], model_1.predict(Xnew), label = 'With Leverage Point', color = 'red')
plt.plot([df['X'].min(), df['X'].max()], model_2.predict(Xnew), label = 'Without Leverage Point')
plt.legend()

<matplotlib.legend.Legend at 0x1c2d5a9610>

png

The blue regression line (with the leverage point) is being pulled down towards that leverage point and is thus impacting the predictions for all the other data points.