GridSearch vs RandomizedSearch for Hyperparameter Tuning

July 7, 2020

I’m going to compare the performance of tuning the hyperparameters with GridSearchCV and RandomizedSearchCV. First, I am going to create a dataset that has 10,000 samples and 20 features.

from sklearn.datasets import make_classification
import numpy as np 
import pandas as pd

# make dataset 
X, y = make_classification(n_samples = 10000, 
                           n_features=20, 
                           n_informative=4, 
                           n_redundant=0, 
                           random_state=11)

df = pd.DataFrame(X)
df['target'] = y
df.head()

	0	1	2	3	4	5	6	7	8	9	...	11	12	13	14	15	16	17	18	19	target
0	0.520216	-0.299523	1.697775	0.152835	-0.071976	0.002353	0.057001	1.656589	0.059377	0.634026	...	0.230848	-2.133668	-0.658056	0.227366	-1.005542	-0.533868	-0.656252	-1.167656	-0.902226	0
1	0.252605	1.432791	1.561181	-1.456888	-0.325153	-1.757407	1.183243	0.931166	0.967256	-1.833468	...	-1.644497	1.259892	1.355751	-1.085283	-1.347220	-0.073796	0.718362	-2.334630	1.531651	0
2	-1.118205	-0.335938	-0.979303	0.188338	-0.346252	-1.263341	-1.037886	-0.870959	2.105311	0.892956	...	0.794894	0.796176	0.193527	-2.070266	-1.183444	-0.231885	1.581976	1.110054	1.610723	1
3	0.334311	1.568198	-0.423843	-0.962124	1.060851	-3.596107	-0.416077	-0.602925	-0.523378	0.834385	...	-0.636568	-2.537476	-0.355572	1.032740	0.195867	-0.227352	-0.332308	0.813405	-1.037039	1
4	-0.803574	-0.573973	2.605967	0.600801	0.823409	0.494084	-0.398244	1.332191	0.273173	1.089310	...	-1.030162	-1.252967	1.109795	-1.197247	-0.681647	-0.786710	0.833898	-0.258752	0.161887	0

5 rows × 21 columns

Next, I’m going to build a parameter grid for an random forest classifier. I got the parameter grid from this great article.

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Create the random grid
param_grid = {'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
                'n_estimators': n_estimators}

Now I will find the best hyperparameters for the random forest using GridSearchCV.

clf = RandomForestClassifier()

rf_gridsearch = GridSearchCV(estimator = clf, param_grid = param_grid,
                             cv = 3, verbose=2, n_jobs = -1)

%%time
rf_gridsearch.fit(df.drop('target', axis = 1), df['target'])

Fitting 3 folds for each of 720 candidates, totalling 2160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  8.8min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed: 23.0min
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed: 43.3min
[Parallel(n_jobs=-1)]: Done 997 tasks      | elapsed: 69.0min
[Parallel(n_jobs=-1)]: Done 1442 tasks      | elapsed: 102.5min
[Parallel(n_jobs=-1)]: Done 1969 tasks      | elapsed: 139.3min
[Parallel(n_jobs=-1)]: Done 2160 out of 2160 | elapsed: 152.9min finished


CPU times: user 14.3 s, sys: 627 ms, total: 14.9 s
Wall time: 2h 32min 59s





GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100,
                                       110, None],
                         'max_features': ['auto', 'sqrt'],
                         'min_samples_split': [2, 5, 10],
                         'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400,
                                          1600, 1800, 2000]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=2)

rf_gridsearch.best_score_

0.9072

Using gridsearch took 2 hours and 32 minutes and got a accuracy of 90.7%. Now I will try using RandomizedSearch.

rf_random = RandomizedSearchCV(estimator = clf, param_distributions = param_grid, n_iter = 100,
                               cv = 3, verbose=2, random_state=11, n_jobs = -1)

%%time
rf_random.fit(df.drop('target', axis = 1), df['target'])

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed: 10.0min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 20.9min finished


CPU times: user 7.08 s, sys: 190 ms, total: 7.27 s
Wall time: 21min 2s





RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    n_estimators='warn',
                                                    n_jobs=None,
                                                    oob_sc...
                                                    verbose=0,
                                                    warm_start=False),
                   iid='warn', n_iter=100, n_jobs=-1,
                   param_distributions={'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   pre_dispatch='2*n_jobs', random_state=11, refit=True,
                   return_train_score=False, scoring=None, verbose=2)

rf_random.best_score_

0.907

Using randomizedsearch I got an accuracy of 90.7% and it took 21 minutes to train. So, I got the same accuracy in significantly less training time using randomizedsearch.