Linear Regression in Sklearn, Statsmodels API, & Statsmodels Formula

July 26, 2020

I will show how to make a linear regression in Sklearn and Statsmodels. First I will use sklearn to make a regression dataset.

from sklearn.datasets import make_regression
import pandas as pd

X, y = make_regression(n_features = 2, noise=10, random_state=11)

df = pd.DataFrame(X, columns=['X1', 'X2'])
df['Y'] = y

df.head()

Sklearn

from sklearn.linear_model import LinearRegression
import numpy as np

lr = LinearRegression()
lr.fit(df[['X1', 'X2']], df['Y'])

Regression coefficients

lr.coef_

array([60.05070199, 59.28817607])

Y Intercept

lr.intercept_

-0.4812452912200803

Prediction for X1 = 0.5 and X2 = 0.5

lr.predict(np.array([.5, .5]).reshape(1, -1))

array([59.18819374])

R^2

lr.score(df[['X1', 'X2']], df['Y'])

0.9846544787076148

Adjusted R^2

R2 = lr.score(df[['X1', 'X2']], df['Y'])
n = len(df)
p = 2

1-(1-R2)*(n-1)/(n-p-1)

0.9843380762067409

mean squared error

from sklearn.metrics import mean_squared_error
mean_squared_error(df['Y'], lr.predict(df[['X1', 'X2']]))

95.90101789061725

from statsmodels.formula.api import ols
formula = 'Y ~ X1 + X2'
model = ols(formula=formula, data=df).fit()
model.summary()

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The coefficients, intercept, R^2 and adjusted R^2 are all in the summary

Prediction for X1 = 0.5 and X2 = 0.5

model.predict(dict(X1 = 0.5, X2 = 0.5))

0    59.188194
dtype: float64

mean squared error

mean_squared_error(df['Y'], model.predict(dict(X1 = df['X1'].values, X2 = df['X2'].values)))

95.90101789061725

import statsmodels.api as sm

X = df[['X1', 'X2']]
Y = df['Y']

# coefficient 
X = sm.add_constant(X)

model = sm.OLS(Y, X).fit()
model.summary()

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The coefficients, intercept, R^2 and adjusted R^2 are all in the summary

Prediction for X1 = 0.5 and X2 = 0.5
Have to add a 1 in the front due to the y intercept

Xnew = np.column_stack([1, .5, .5])
model.predict(Xnew)

array([59.18819374])

mean squared error

mean_squared_error(df['Y'], model.predict(X))

95.90101789061725