Testing Multicollinearity with VIF

Multicollinearity is where 3 or more features are highly correlated. I am going to demonstrate how to calculate when you are experiencing multicollinearity using variance inflation factor (VIF).

To start I am going to use sklearn to make a dataset.

from sklearn.datasets import make_regression
import pandas as pd
import numpy as np
np.random.seed(11)
X, y = make_regression(n_features = 10, noise=10, random_state=11, )
df = pd.DataFrame(X, columns=['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10'])
df['Y'] = y

I am going to create a new feature that is a combination of the other features.

df['X11'] = df['X1'] - df['X2'] + df['X3'] + np.random.normal(scale = 15)

Now lets look at the correlation heat map to see if there is any highly correlated features.

import seaborn as sns 
sns.heatmap(df.corr(), vmin = -1, vmax = 1)
<matplotlib.axes._subplots.AxesSubplot at 0x1c2ebcc590>

png

sns.heatmap(np.abs(df.corr()) > 0.7)
<matplotlib.axes._subplots.AxesSubplot at 0x1c2e9c38d0>

png

We see that none of the features have a correlation above 0.70 or below -0.70. Next, I am going to use VIF to determine if I am experiencing multicollinearity.

from statsmodels.stats.outliers_influence import variance_inflation_factor

Note: per this open issue on variance inflation factor you need to add a constant variable to your dataframe.

df = sm.add_constant(df)

I am going to test out all the features to see what the VIF value is. I am skipping the first feature (the intercept term).

for i in range(1, 12):
    print(f'{df.drop("Y", axis = 1).columns[i]} {variance_inflation_factor(df.drop("Y", axis = 1).values, i)}')
X1 inf
X2 inf
X3 inf
X4 1.0410536144535685
X5 1.0917593842615227
X6 1.0955635908289258
X7 1.0425296201253647
X8 1.1118123228822059
X9 1.051488347630868
X10 1.1173223003485167
X11 inf

I see I have multiple features that have a VIF value of inf. Lets see what is causing that.

Below is the formula for how VIF is calculated.

\[VIF = \frac{1}{1 - R^2}\]

From above, I got a VIF of inf for X1. To calculate VIF for X1 - I am going to build a regression using all the independent variables except for X1 and use those features to predict X1.

X = df.drop(['X1', 'Y'], axis = 1)
Y = df['X1']

model = sm.OLS(Y, X).fit()
model.summary()
OLS Regression Results
Dep. Variable:X1 R-squared: 1.000
Model:OLS Adj. R-squared: 1.000
Method:Least Squares F-statistic:1.305e+27
Date:Sun, 26 Jul 2020 Prob (F-statistic): 0.00
Time:16:55:53 Log-Likelihood: 2874.8
No. Observations: 100 AIC: -5728.
Df Residuals: 89 BIC: -5699.
Df Model: 10
Covariance Type:nonrobust
coefstd errtP>|t|[0.0250.975]
const -26.2418 2.42e-13-1.08e+14 0.000 -26.242 -26.242
X2 1.0000 1.26e-14 7.92e+13 0.000 1.000 1.000
X3 -1.0000 1.18e-14-8.47e+13 0.000 -1.000 -1.000
X4-3.816e-16 8.28e-15 -0.046 0.963-1.68e-14 1.61e-14
X5 3.088e-16 9.08e-15 0.034 0.973-1.77e-14 1.84e-14
X6 2.984e-16 9.72e-15 0.031 0.976 -1.9e-14 1.96e-14
X7-1.388e-17 8.44e-15 -0.002 0.999-1.68e-14 1.68e-14
X8 1.596e-16 8.42e-15 0.019 0.985-1.66e-14 1.69e-14
X9-6.635e-17 8.85e-15 -0.007 0.994-1.77e-14 1.75e-14
X10-2.533e-16 8.44e-15 -0.030 0.976 -1.7e-14 1.65e-14
X11 1.0000 9.2e-15 1.09e+14 0.000 1.000 1.000
Omnibus: 0.540 Durbin-Watson: 0.009
Prob(Omnibus): 0.763 Jarque-Bera (JB): 0.660
Skew: 0.157 Prob(JB): 0.719
Kurtosis: 2.757 Cond. No. 761.



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

I see that I get a perfect R^2. So plugging this into my VIF formula

\[VIF = \frac{1}{1 - 1}\]

This is where the inf comes from. I have a problem because I am able to perfectly predict X1 from the other features. Now lets see what happens when I remove the feature I created.

df.drop(['Y', 'X11'], axis = 1, inplace = True)
for i in range(1, 11):
    print(f'{df.columns[i]} {variance_inflation_factor(df.values, i)}')
X1 1.1056812354246772
X2 1.0741738574018993
X3 1.0836168289787529
X4 1.0410536144535685
X5 1.0917593842615227
X6 1.0955635908289256
X7 1.0425296201253647
X8 1.1118123228822059
X9 1.051488347630868
X10 1.1173223003485164

Now, I see that I get value for each of the 10 features. We typically use a threshold of either below 5 or below 10 as a good VIF. A VIF of 5 means that the other features are able to predict that feature with a R^2 of 0.80. A VIF of 10 means that the other features are able to predict that feature with a R^2 of 0.90.