Polynomial Regression and Step Regression

These are notes on using polynomial regression and step functions. These are taken from

Polynomial Regression

Polynomial regression is used to extend linear regression in which the relationship between your predictors and target is non-linear. Below I am going to show an example of trying four different versions of polynomial regression.

import pandas as pd 
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures

df = pd.read_csv('https://raw.githubusercontent.com/sik-flow/datasets/master/Wage.csv')
df.head()
yearagemaritleducationregionjobclasshealthhealth_inslogwagewage
02006181. Never Married1. < HS Grad2. Middle Atlantic1. Industrial1. <=Good2. No4.31806375.043154
12004241. Never Married4. College Grad2. Middle Atlantic2. Information2. >=Very Good2. No4.25527370.476020
22003452. Married3. Some College2. Middle Atlantic1. Industrial1. <=Good1. Yes4.875061130.982177
32003432. Married4. College Grad2. Middle Atlantic2. Information2. >=Very Good1. Yes5.041393154.685293
42005504. Divorced2. HS Grad2. Middle Atlantic2. Information1. <=Good1. Yes4.31806375.043154

Apply polynomial regression to the age column

X1 = PolynomialFeatures(1).fit_transform(df.age.values.reshape(-1,1))
X2 = PolynomialFeatures(2).fit_transform(df.age.values.reshape(-1,1))
X3 = PolynomialFeatures(3).fit_transform(df.age.values.reshape(-1,1))
X4 = PolynomialFeatures(4).fit_transform(df.age.values.reshape(-1,1))

Fit a model using the age column

fit1 = sm.GLS(df.wage, X1).fit()
fit2 = sm.GLS(df.wage, X2).fit()
fit3 = sm.GLS(df.wage, X3).fit()
fit4 = sm.GLS(df.wage, X4).fit()


# Generate a sequence of age values spanning the range
age_grid = np.arange(df.age.min(), df.age.max()).reshape(-1,1)


# Predict the value of the generated ages
pred1 = fit1.predict(PolynomialFeatures(1).fit_transform(age_grid))
pred2 = fit2.predict(PolynomialFeatures(2).fit_transform(age_grid))
pred3 = fit3.predict(PolynomialFeatures(3).fit_transform(age_grid))
pred4 = fit4.predict(PolynomialFeatures(4).fit_transform(age_grid))

Plot out the model fits

fig, ax = plt.subplots(2,2, figsize = (12,5))

ax[0][0].scatter(df.age, df.wage, facecolor='None', edgecolor='k', alpha=0.3)
ax[0][0].plot(age_grid, pred1, color = 'b')
ax[0][0].set_ylim(ymin=0)
ax[0][0].set_title('Poly = 1')

ax[0][1].scatter(df.age, df.wage, facecolor='None', edgecolor='k', alpha=0.3)
ax[0][1].plot(age_grid, pred2, color = 'b')
ax[0][1].set_ylim(ymin=0)
ax[0][1].set_title('Poly = 2')

ax[1][0].scatter(df.age, df.wage, facecolor='None', edgecolor='k', alpha=0.3)
ax[1][0].plot(age_grid, pred3, color = 'b')
ax[1][0].set_ylim(ymin=0)
ax[1][0].set_title('Poly = 3')

ax[1][1].scatter(df.age, df.wage, facecolor='None', edgecolor='k', alpha=0.3)
ax[1][1].plot(age_grid, pred4, color = 'b')
ax[1][1].set_ylim(ymin=0)
ax[1][1].set_title('Poly = 4')

fig.subplots_adjust(hspace=.5)

png

Step Functions

Step functions can be used to fit different models to different parts of the data. I am going to put the age column into 4 different bins.

df_cut, bins = pd.cut(df.age, 4, retbins = True, right = True)
df_cut.value_counts(sort = False)
(17.938, 33.5]     750
(33.5, 49.0]      1399
(49.0, 64.5]       779
(64.5, 80.0]        72
Name: age, dtype: int64
df_steps = pd.concat([df.age, df_cut, df.wage], keys = ['age','age_cuts','wage'], axis = 1)

# Create dummy variables for the age groups
df_steps_dummies = pd.get_dummies(df_steps['age_cuts'])

# Statsmodels requires explicit adding of a constant (intercept)
df_steps_dummies = sm.add_constant(df_steps_dummies)

# Drop the (17.938, 33.5] category
df_steps_dummies = df_steps_dummies.drop(df_steps_dummies.columns[1], axis = 1)

df_steps_dummies.head(5)
const(33.5, 49.0](49.0, 64.5](64.5, 80.0]
01.0000
11.0000
21.0100
31.0100
41.0010
fit3 = sm.GLM(df_steps.wage, df_steps_dummies).fit()
# Put the test data in the same bins as the training data.
bin_mapping = np.digitize(age_grid.ravel(), bins)

# Get dummies, drop first dummy category, add constant
X_test2 = sm.add_constant(pd.get_dummies(bin_mapping).drop(1, axis = 1))

# Predict the value of the generated ages using the linear model
pred2 = fit3.predict(X_test2)

# Plot
fig, ax = plt.subplots(figsize = (12,5))
fig.suptitle('Piecewise Constant', fontsize = 14)

# Scatter plot with polynomial regression line
ax.scatter(df.age, df.wage, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
ax.plot(age_grid, pred2, c = 'b')

ax.set_xlabel('age')
ax.set_ylabel('wage')
ax.set_ylim(ymin = 0);

png