Simple GAM with PyGAM

August 8, 2020

Show how to implement a GAM using PyGam

Start with loading in a dataset

from pygam.datasets import wage

X, y = wage()

Fit a GAM - first 2 features will be using a spline term and the 3rd feature will be using a factor term.

from pygam import LinearGAM, s, f

gam = LinearGAM(s(0) + s(1) + f(2)).fit(X, y)

gam.summary()

LinearGAM                                                                                                 
=============================================== ==========================================================
Distribution:                        NormalDist Effective DoF:                                     25.1911
Link Function:                     IdentityLink Log Likelihood:                                -24118.6847
Number of Samples:                         3000 AIC:                                            48289.7516
                                                AICc:                                           48290.2307
                                                GCV:                                             1255.6902
                                                Scale:                                           1236.7251
                                                Pseudo R-Squared:                                   0.2955
==========================================================================================================
Feature Function                  Lambda               Rank         EDoF         P > x        Sig. Code   
================================= ==================== ============ ============ ============ ============
s(0)                              [0.6]                20           7.1          5.95e-03     **          
s(1)                              [0.6]                20           14.1         1.11e-16     ***         
f(2)                              [0.6]                5            4.0          1.11e-16     ***         
intercept                                              1            0.0          1.11e-16     ***         
==========================================================================================================
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

WARNING: Fitting splines and a linear function to a feature introduces a model identifiability problem
         which can cause p-values to appear significant when they are not.

WARNING: p-values calculated in this manner behave correctly for un-penalized models or models with
         known smoothing parameters, but when smoothing parameters have been estimated, the p-values
         are typically lower than they should be, meaning that the tests reject the null too readily.


/Users/jeffreyherman/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: KNOWN BUG: p-values computed in this summary are likely much smaller than they should be. 
 
Please do not make inferences based on these values! 

Collaborate on a solution, and stay up to date at: 
github.com/dswah/pyGAM/issues/163 

  """Entry point for launching an IPython kernel.

gam.lam

[[0.6], [0.6], [0.6]]

We see the lambda term for each feature is 0.6. I am going to tune this using gridsearch.

import numpy as np

lam = np.logspace(-3, 5, 5)
lams = [lam] * 3

gam.gridsearch(X, y, lam=lams)
gam.summary()

100% (125 of 125) |######################| Elapsed Time: 0:00:03 Time:  0:00:03


LinearGAM                                                                                                 
=============================================== ==========================================================
Distribution:                        NormalDist Effective DoF:                                      9.6668
Link Function:                     IdentityLink Log Likelihood:                                 -24119.413
Number of Samples:                         3000 AIC:                                            48260.1595
                                                AICc:                                           48260.2428
                                                GCV:                                             1244.2375
                                                Scale:                                           1237.0229
                                                Pseudo R-Squared:                                   0.2916
==========================================================================================================
Feature Function                  Lambda               Rank         EDoF         P > x        Sig. Code   
================================= ==================== ============ ============ ============ ============
s(0)                              [100000.]            20           2.4          9.04e-03     **          
s(1)                              [1000.]              20           3.3          1.11e-16     ***         
f(2)                              [0.1]                5            3.9          1.11e-16     ***         
intercept                                              1            0.0          1.11e-16     ***         
==========================================================================================================
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

WARNING: Fitting splines and a linear function to a feature introduces a model identifiability problem
         which can cause p-values to appear significant when they are not.

WARNING: p-values calculated in this manner behave correctly for un-penalized models or models with
         known smoothing parameters, but when smoothing parameters have been estimated, the p-values
         are typically lower than they should be, meaning that the tests reject the null too readily.


/Users/jeffreyherman/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:7: UserWarning: KNOWN BUG: p-values computed in this summary are likely much smaller than they should be. 
 
Please do not make inferences based on these values! 

Collaborate on a solution, and stay up to date at: 
github.com/dswah/pyGAM/issues/163 

  import sys

Now lets see the partial dependence plots of our features

import matplotlib.pyplot as plt

for i, term in enumerate(gam.terms):
    if term.isintercept:
        continue

    XX = gam.generate_X_grid(term=i)
    pdep, confi = gam.partial_dependence(term=i, X=XX, width=0.95)

    plt.figure()
    plt.plot(XX[:, term.feature], pdep)
    plt.plot(XX[:, term.feature], confi, c='r', ls='--')
    plt.title(repr(term))
    plt.show()

png

We see the first feature has a steady increase, the second feature increases to a point, then levels off and finally decreases, while the third feature increases.

Now I am going to compare the performance vs linear regression

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=11)

gam = LinearGAM(s(0) + s(1) + f(2)).fit(X, y)

lam = np.logspace(-3, 5, 5)
lams = [lam] * 3

gam.gridsearch(X_train, y_train, lam=lams)
gam.summary()

100% (125 of 125) |######################| Elapsed Time: 0:00:03 Time:  0:00:03


LinearGAM                                                                                                 
=============================================== ==========================================================
Distribution:                        NormalDist Effective DoF:                                      9.3578
Link Function:                     IdentityLink Log Likelihood:                                -18073.3788
Number of Samples:                         2250 AIC:                                            36167.4732
                                                AICc:                                           36167.5783
                                                GCV:                                             1237.4156
                                                Scale:                                           1228.1554
                                                Pseudo R-Squared:                                   0.2918
==========================================================================================================
Feature Function                  Lambda               Rank         EDoF         P > x        Sig. Code   
================================= ==================== ============ ============ ============ ============
s(0)                              [100000.]            20           2.3          5.58e-02     .           
s(1)                              [1000.]              20           3.1          1.11e-16     ***         
f(2)                              [0.1]                5            3.9          1.11e-16     ***         
intercept                                              1            0.0          1.11e-16     ***         
==========================================================================================================
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

WARNING: Fitting splines and a linear function to a feature introduces a model identifiability problem
         which can cause p-values to appear significant when they are not.

WARNING: p-values calculated in this manner behave correctly for un-penalized models or models with
         known smoothing parameters, but when smoothing parameters have been estimated, the p-values
         are typically lower than they should be, meaning that the tests reject the null too readily.


/Users/jeffreyherman/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:7: UserWarning: KNOWN BUG: p-values computed in this summary are likely much smaller than they should be. 
 
Please do not make inferences based on these values! 

Collaborate on a solution, and stay up to date at: 
github.com/dswah/pyGAM/issues/163 

  import sys

from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, gam.predict(X_test))

1265.8585409629538

import pandas as pd

df_train = pd.DataFrame(X_train)
df_train['Y'] = y_train
df_train.columns = ['X1', 'X2', 'X3', 'Y']

df_test = pd.DataFrame(X_test)
df_test['Y'] = y_test
df_test.columns = ['X1', 'X2', 'X3', 'Y']

df_train.head()

	X1	X2	X3	Y
0	2009.0	56.0	0.0	79.854900
1	2005.0	45.0	2.0	81.283253
2	2008.0	33.0	1.0	128.680488
3	2006.0	51.0	4.0	153.457515
4	2005.0	29.0	3.0	128.680488

from statsmodels.formula.api import ols
formula = 'Y ~ X1 + X2 + C(X3)'
model = ols(formula=formula, data=df_train).fit()
model.summary()

OLS Regression Results
Dep. Variable:	Y	R-squared:	0.258
Model:	OLS	Adj. R-squared:	0.256
Method:	Least Squares	F-statistic:	130.3
Date:	Sat, 08 Aug 2020	Prob (F-statistic):	9.99e-142
Time:	21:04:36	Log-Likelihood:	-11242.
No. Observations:	2250	AIC:	2.250e+04
Df Residuals:	2243	BIC:	2.254e+04
Df Model:	6
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-1906.3133	745.462	-2.557	0.011	-3368.182	-444.445
C(X3)[T.1.0]	10.0980	2.929	3.447	0.001	4.353	15.842
C(X3)[T.2.0]	21.2259	3.060	6.936	0.000	15.224	27.228
C(X3)[T.3.0]	38.9505	3.055	12.749	0.000	32.959	44.942
C(X3)[T.4.0]	62.7525	3.282	19.118	0.000	56.316	69.189
X1	0.9817	0.372	2.641	0.008	0.253	1.711
X2	0.5522	0.066	8.371	0.000	0.423	0.682

Omnibus:	667.758	Durbin-Watson:	2.000
Prob(Omnibus):	0.000	Jarque-Bera (JB):	3095.965
Skew:	1.347	Prob(JB):	0.00
Kurtosis:	8.076	Cond. No.	1.98e+06

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.98e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

mean_squared_error(df_test['Y'], model.predict(dict(X1 = df_test['X1'].values, 
                                                    X2 = df_test['X2'].values,
                                                   X3 = df_test['X3'].values)))

1300.8156546398066

We see the GAM has a lower MSE