Simple GAM with PyGAM

Show how to implement a GAM using PyGam

Start with loading in a dataset

from pygam.datasets import wage

X, y = wage()

Fit a GAM - first 2 features will be using a spline term and the 3rd feature will be using a factor term.

from pygam import LinearGAM, s, f

gam = LinearGAM(s(0) + s(1) + f(2)).fit(X, y)
gam.summary()
LinearGAM                                                                                                 
=============================================== ==========================================================
Distribution:                        NormalDist Effective DoF:                                     25.1911
Link Function:                     IdentityLink Log Likelihood:                                -24118.6847
Number of Samples:                         3000 AIC:                                            48289.7516
                                                AICc:                                           48290.2307
                                                GCV:                                             1255.6902
                                                Scale:                                           1236.7251
                                                Pseudo R-Squared:                                   0.2955
==========================================================================================================
Feature Function                  Lambda               Rank         EDoF         P > x        Sig. Code   
================================= ==================== ============ ============ ============ ============
s(0)                              [0.6]                20           7.1          5.95e-03     **          
s(1)                              [0.6]                20           14.1         1.11e-16     ***         
f(2)                              [0.6]                5            4.0          1.11e-16     ***         
intercept                                              1            0.0          1.11e-16     ***         
==========================================================================================================
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

WARNING: Fitting splines and a linear function to a feature introduces a model identifiability problem
         which can cause p-values to appear significant when they are not.

WARNING: p-values calculated in this manner behave correctly for un-penalized models or models with
         known smoothing parameters, but when smoothing parameters have been estimated, the p-values
         are typically lower than they should be, meaning that the tests reject the null too readily.


/Users/jeffreyherman/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: KNOWN BUG: p-values computed in this summary are likely much smaller than they should be. 
 
Please do not make inferences based on these values! 

Collaborate on a solution, and stay up to date at: 
github.com/dswah/pyGAM/issues/163 

  """Entry point for launching an IPython kernel.
gam.lam
[[0.6], [0.6], [0.6]]

We see the lambda term for each feature is 0.6. I am going to tune this using gridsearch.

import numpy as np

lam = np.logspace(-3, 5, 5)
lams = [lam] * 3

gam.gridsearch(X, y, lam=lams)
gam.summary()
100% (125 of 125) |######################| Elapsed Time: 0:00:03 Time:  0:00:03


LinearGAM                                                                                                 
=============================================== ==========================================================
Distribution:                        NormalDist Effective DoF:                                      9.6668
Link Function:                     IdentityLink Log Likelihood:                                 -24119.413
Number of Samples:                         3000 AIC:                                            48260.1595
                                                AICc:                                           48260.2428
                                                GCV:                                             1244.2375
                                                Scale:                                           1237.0229
                                                Pseudo R-Squared:                                   0.2916
==========================================================================================================
Feature Function                  Lambda               Rank         EDoF         P > x        Sig. Code   
================================= ==================== ============ ============ ============ ============
s(0)                              [100000.]            20           2.4          9.04e-03     **          
s(1)                              [1000.]              20           3.3          1.11e-16     ***         
f(2)                              [0.1]                5            3.9          1.11e-16     ***         
intercept                                              1            0.0          1.11e-16     ***         
==========================================================================================================
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

WARNING: Fitting splines and a linear function to a feature introduces a model identifiability problem
         which can cause p-values to appear significant when they are not.

WARNING: p-values calculated in this manner behave correctly for un-penalized models or models with
         known smoothing parameters, but when smoothing parameters have been estimated, the p-values
         are typically lower than they should be, meaning that the tests reject the null too readily.


/Users/jeffreyherman/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:7: UserWarning: KNOWN BUG: p-values computed in this summary are likely much smaller than they should be. 
 
Please do not make inferences based on these values! 

Collaborate on a solution, and stay up to date at: 
github.com/dswah/pyGAM/issues/163 

  import sys

Now lets see the partial dependence plots of our features

import matplotlib.pyplot as plt

for i, term in enumerate(gam.terms):
    if term.isintercept:
        continue

    XX = gam.generate_X_grid(term=i)
    pdep, confi = gam.partial_dependence(term=i, X=XX, width=0.95)

    plt.figure()
    plt.plot(XX[:, term.feature], pdep)
    plt.plot(XX[:, term.feature], confi, c='r', ls='--')
    plt.title(repr(term))
    plt.show()

png

png

png

We see the first feature has a steady increase, the second feature increases to a point, then levels off and finally decreases, while the third feature increases.

Now I am going to compare the performance vs linear regression

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=11)
gam = LinearGAM(s(0) + s(1) + f(2)).fit(X, y)

lam = np.logspace(-3, 5, 5)
lams = [lam] * 3

gam.gridsearch(X_train, y_train, lam=lams)
gam.summary()
100% (125 of 125) |######################| Elapsed Time: 0:00:03 Time:  0:00:03


LinearGAM                                                                                                 
=============================================== ==========================================================
Distribution:                        NormalDist Effective DoF:                                      9.3578
Link Function:                     IdentityLink Log Likelihood:                                -18073.3788
Number of Samples:                         2250 AIC:                                            36167.4732
                                                AICc:                                           36167.5783
                                                GCV:                                             1237.4156
                                                Scale:                                           1228.1554
                                                Pseudo R-Squared:                                   0.2918
==========================================================================================================
Feature Function                  Lambda               Rank         EDoF         P > x        Sig. Code   
================================= ==================== ============ ============ ============ ============
s(0)                              [100000.]            20           2.3          5.58e-02     .           
s(1)                              [1000.]              20           3.1          1.11e-16     ***         
f(2)                              [0.1]                5            3.9          1.11e-16     ***         
intercept                                              1            0.0          1.11e-16     ***         
==========================================================================================================
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

WARNING: Fitting splines and a linear function to a feature introduces a model identifiability problem
         which can cause p-values to appear significant when they are not.

WARNING: p-values calculated in this manner behave correctly for un-penalized models or models with
         known smoothing parameters, but when smoothing parameters have been estimated, the p-values
         are typically lower than they should be, meaning that the tests reject the null too readily.


/Users/jeffreyherman/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:7: UserWarning: KNOWN BUG: p-values computed in this summary are likely much smaller than they should be. 
 
Please do not make inferences based on these values! 

Collaborate on a solution, and stay up to date at: 
github.com/dswah/pyGAM/issues/163 

  import sys
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, gam.predict(X_test))
1265.8585409629538
import pandas as pd

df_train = pd.DataFrame(X_train)
df_train['Y'] = y_train
df_train.columns = ['X1', 'X2', 'X3', 'Y']

df_test = pd.DataFrame(X_test)
df_test['Y'] = y_test
df_test.columns = ['X1', 'X2', 'X3', 'Y']
df_train.head()
X1X2X3Y
02009.056.00.079.854900
12005.045.02.081.283253
22008.033.01.0128.680488
32006.051.04.0153.457515
42005.029.03.0128.680488
from statsmodels.formula.api import ols
formula = 'Y ~ X1 + X2 + C(X3)'
model = ols(formula=formula, data=df_train).fit()
model.summary()
OLS Regression Results
Dep. Variable:Y R-squared: 0.258
Model:OLS Adj. R-squared: 0.256
Method:Least Squares F-statistic: 130.3
Date:Sat, 08 Aug 2020 Prob (F-statistic):9.99e-142
Time:21:04:36 Log-Likelihood: -11242.
No. Observations: 2250 AIC:2.250e+04
Df Residuals: 2243 BIC:2.254e+04
Df Model: 6
Covariance Type:nonrobust
coefstd errtP>|t|[0.0250.975]
Intercept-1906.3133 745.462 -2.557 0.011-3368.182 -444.445
C(X3)[T.1.0] 10.0980 2.929 3.447 0.001 4.353 15.842
C(X3)[T.2.0] 21.2259 3.060 6.936 0.000 15.224 27.228
C(X3)[T.3.0] 38.9505 3.055 12.749 0.000 32.959 44.942
C(X3)[T.4.0] 62.7525 3.282 19.118 0.000 56.316 69.189
X1 0.9817 0.372 2.641 0.008 0.253 1.711
X2 0.5522 0.066 8.371 0.000 0.423 0.682
Omnibus:667.758 Durbin-Watson: 2.000
Prob(Omnibus): 0.000 Jarque-Bera (JB):3095.965
Skew: 1.347 Prob(JB): 0.00
Kurtosis: 8.076 Cond. No.1.98e+06



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.98e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

mean_squared_error(df_test['Y'], model.predict(dict(X1 = df_test['X1'].values, 
                                                    X2 = df_test['X2'].values,
                                                   X3 = df_test['X3'].values)))
1300.8156546398066

We see the GAM has a lower MSE