Linear Regression in Sklearn, Statsmodels API, & Statsmodels Formula

I will show how to make a linear regression in Sklearn and Statsmodels. First I will use sklearn to make a regression dataset.

from sklearn.datasets import make_regression
import pandas as pd
X, y = make_regression(n_features = 2, noise=10, random_state=11)
df = pd.DataFrame(X, columns=['X1', 'X2'])
df['Y'] = y
df.head()
X1X2Y
00.2173480.11782035.528041
1-0.5293721.56170453.670233
2-0.886240-0.475733-93.494490
3-0.713560-1.908290-142.676470
4-1.297423-0.714962-107.748177

Sklearn

from sklearn.linear_model import LinearRegression
import numpy as np

lr = LinearRegression()
lr.fit(df[['X1', 'X2']], df['Y'])

Regression coefficients

lr.coef_
array([60.05070199, 59.28817607])

Y Intercept

lr.intercept_
-0.4812452912200803

Prediction for X1 = 0.5 and X2 = 0.5

lr.predict(np.array([.5, .5]).reshape(1, -1))
array([59.18819374])

R^2

lr.score(df[['X1', 'X2']], df['Y'])
0.9846544787076148

Adjusted R^2

R2 = lr.score(df[['X1', 'X2']], df['Y'])
n = len(df)
p = 2

1-(1-R2)*(n-1)/(n-p-1)
0.9843380762067409

mean squared error

from sklearn.metrics import mean_squared_error
mean_squared_error(df['Y'], lr.predict(df[['X1', 'X2']]))
95.90101789061725

Statsmodels formula

from statsmodels.formula.api import ols
formula = 'Y ~ X1 + X2'
model = ols(formula=formula, data=df).fit()
model.summary()
OLS Regression Results
Dep. Variable:Y R-squared: 0.985
Model:OLS Adj. R-squared: 0.984
Method:Least Squares F-statistic: 3112.
Date:Sun, 26 Jul 2020 Prob (F-statistic):1.05e-88
Time:13:12:37 Log-Likelihood: -370.06
No. Observations: 100 AIC: 746.1
Df Residuals: 97 BIC: 753.9
Df Model: 2
Covariance Type:nonrobust
coefstd errtP>|t|[0.0250.975]
Intercept -0.4812 0.994 -0.484 0.630 -2.455 1.492
X1 60.0507 1.052 57.082 0.000 57.963 62.139
X2 59.2882 1.059 56.000 0.000 57.187 61.389
Omnibus: 0.469 Durbin-Watson: 1.874
Prob(Omnibus): 0.791 Jarque-Bera (JB): 0.619
Skew:-0.128 Prob(JB): 0.734
Kurtosis: 2.711 Cond. No. 1.08



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The coefficients, intercept, R^2 and adjusted R^2 are all in the summary

Prediction for X1 = 0.5 and X2 = 0.5

model.predict(dict(X1 = 0.5, X2 = 0.5))
0    59.188194
dtype: float64

mean squared error

mean_squared_error(df['Y'], model.predict(dict(X1 = df['X1'].values, X2 = df['X2'].values)))
95.90101789061725

Statsmodels API

import statsmodels.api as sm

X = df[['X1', 'X2']]
Y = df['Y']

# coefficient 
X = sm.add_constant(X)

model = sm.OLS(Y, X).fit()
model.summary()
OLS Regression Results
Dep. Variable:Y R-squared: 0.985
Model:OLS Adj. R-squared: 0.984
Method:Least Squares F-statistic: 3112.
Date:Sun, 26 Jul 2020 Prob (F-statistic):1.05e-88
Time:13:16:34 Log-Likelihood: -370.06
No. Observations: 100 AIC: 746.1
Df Residuals: 97 BIC: 753.9
Df Model: 2
Covariance Type:nonrobust
coefstd errtP>|t|[0.0250.975]
const -0.4812 0.994 -0.484 0.630 -2.455 1.492
X1 60.0507 1.052 57.082 0.000 57.963 62.139
X2 59.2882 1.059 56.000 0.000 57.187 61.389
Omnibus: 0.469 Durbin-Watson: 1.874
Prob(Omnibus): 0.791 Jarque-Bera (JB): 0.619
Skew:-0.128 Prob(JB): 0.734
Kurtosis: 2.711 Cond. No. 1.08



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The coefficients, intercept, R^2 and adjusted R^2 are all in the summary

Prediction for X1 = 0.5 and X2 = 0.5
Have to add a 1 in the front due to the y intercept

Xnew = np.column_stack([1, .5, .5])
model.predict(Xnew)
array([59.18819374])

mean squared error

mean_squared_error(df['Y'], model.predict(X))
95.90101789061725