Linear Regression in Sklearn, Statsmodels API, & Statsmodels Formula
I will show how to make a linear regression in Sklearn and Statsmodels. First I will use sklearn to make a regression dataset.
from sklearn.datasets import make_regression
import pandas as pd
X, y = make_regression(n_features = 2, noise=10, random_state=11)
df = pd.DataFrame(X, columns=['X1', 'X2'])
df['Y'] = y
df.head()
X1 | X2 | Y | |
---|---|---|---|
0 | 0.217348 | 0.117820 | 35.528041 |
1 | -0.529372 | 1.561704 | 53.670233 |
2 | -0.886240 | -0.475733 | -93.494490 |
3 | -0.713560 | -1.908290 | -142.676470 |
4 | -1.297423 | -0.714962 | -107.748177 |
Sklearn
from sklearn.linear_model import LinearRegression
import numpy as np
lr = LinearRegression()
lr.fit(df[['X1', 'X2']], df['Y'])
Regression coefficients
lr.coef_
array([60.05070199, 59.28817607])
Y Intercept
lr.intercept_
-0.4812452912200803
Prediction for X1 = 0.5 and X2 = 0.5
lr.predict(np.array([.5, .5]).reshape(1, -1))
array([59.18819374])
R^2
lr.score(df[['X1', 'X2']], df['Y'])
0.9846544787076148
Adjusted R^2
R2 = lr.score(df[['X1', 'X2']], df['Y'])
n = len(df)
p = 2
1-(1-R2)*(n-1)/(n-p-1)
0.9843380762067409
mean squared error
from sklearn.metrics import mean_squared_error
mean_squared_error(df['Y'], lr.predict(df[['X1', 'X2']]))
95.90101789061725
Statsmodels formula
from statsmodels.formula.api import ols
formula = 'Y ~ X1 + X2'
model = ols(formula=formula, data=df).fit()
model.summary()
Dep. Variable: | Y | R-squared: | 0.985 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.984 |
Method: | Least Squares | F-statistic: | 3112. |
Date: | Sun, 26 Jul 2020 | Prob (F-statistic): | 1.05e-88 |
Time: | 13:12:37 | Log-Likelihood: | -370.06 |
No. Observations: | 100 | AIC: | 746.1 |
Df Residuals: | 97 | BIC: | 753.9 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.4812 | 0.994 | -0.484 | 0.630 | -2.455 | 1.492 |
X1 | 60.0507 | 1.052 | 57.082 | 0.000 | 57.963 | 62.139 |
X2 | 59.2882 | 1.059 | 56.000 | 0.000 | 57.187 | 61.389 |
Omnibus: | 0.469 | Durbin-Watson: | 1.874 |
---|---|---|---|
Prob(Omnibus): | 0.791 | Jarque-Bera (JB): | 0.619 |
Skew: | -0.128 | Prob(JB): | 0.734 |
Kurtosis: | 2.711 | Cond. No. | 1.08 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The coefficients, intercept, R^2 and adjusted R^2 are all in the summary
Prediction for X1 = 0.5 and X2 = 0.5
model.predict(dict(X1 = 0.5, X2 = 0.5))
0 59.188194
dtype: float64
mean squared error
mean_squared_error(df['Y'], model.predict(dict(X1 = df['X1'].values, X2 = df['X2'].values)))
95.90101789061725
Statsmodels API
import statsmodels.api as sm
X = df[['X1', 'X2']]
Y = df['Y']
# coefficient
X = sm.add_constant(X)
model = sm.OLS(Y, X).fit()
model.summary()
Dep. Variable: | Y | R-squared: | 0.985 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.984 |
Method: | Least Squares | F-statistic: | 3112. |
Date: | Sun, 26 Jul 2020 | Prob (F-statistic): | 1.05e-88 |
Time: | 13:16:34 | Log-Likelihood: | -370.06 |
No. Observations: | 100 | AIC: | 746.1 |
Df Residuals: | 97 | BIC: | 753.9 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -0.4812 | 0.994 | -0.484 | 0.630 | -2.455 | 1.492 |
X1 | 60.0507 | 1.052 | 57.082 | 0.000 | 57.963 | 62.139 |
X2 | 59.2882 | 1.059 | 56.000 | 0.000 | 57.187 | 61.389 |
Omnibus: | 0.469 | Durbin-Watson: | 1.874 |
---|---|---|---|
Prob(Omnibus): | 0.791 | Jarque-Bera (JB): | 0.619 |
Skew: | -0.128 | Prob(JB): | 0.734 |
Kurtosis: | 2.711 | Cond. No. | 1.08 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The coefficients, intercept, R^2 and adjusted R^2 are all in the summary
Prediction for X1 = 0.5 and X2 = 0.5
Have to add a 1 in the front due to the y intercept
Xnew = np.column_stack([1, .5, .5])
model.predict(Xnew)
array([59.18819374])
mean squared error
mean_squared_error(df['Y'], model.predict(X))
95.90101789061725