GridSearch vs RandomizedSearch for Hyperparameter Tuning

I’m going to compare the performance of tuning the hyperparameters with GridSearchCV and RandomizedSearchCV. First, I am going to create a dataset that has 10,000 samples and 20 features.

from sklearn.datasets import make_classification
import numpy as np 
import pandas as pd

# make dataset 
X, y = make_classification(n_samples = 10000, 
                           n_features=20, 
                           n_informative=4, 
                           n_redundant=0, 
                           random_state=11)

df = pd.DataFrame(X)
df['target'] = y
df.head()
0123456789...111213141516171819target
00.520216-0.2995231.6977750.152835-0.0719760.0023530.0570011.6565890.0593770.634026...0.230848-2.133668-0.6580560.227366-1.005542-0.533868-0.656252-1.167656-0.9022260
10.2526051.4327911.561181-1.456888-0.325153-1.7574071.1832430.9311660.967256-1.833468...-1.6444971.2598921.355751-1.085283-1.347220-0.0737960.718362-2.3346301.5316510
2-1.118205-0.335938-0.9793030.188338-0.346252-1.263341-1.037886-0.8709592.1053110.892956...0.7948940.7961760.193527-2.070266-1.183444-0.2318851.5819761.1100541.6107231
30.3343111.568198-0.423843-0.9621241.060851-3.596107-0.416077-0.602925-0.5233780.834385...-0.636568-2.537476-0.3555721.0327400.195867-0.227352-0.3323080.813405-1.0370391
4-0.803574-0.5739732.6059670.6008010.8234090.494084-0.3982441.3321910.2731731.089310...-1.030162-1.2529671.109795-1.197247-0.681647-0.7867100.833898-0.2587520.1618870

5 rows × 21 columns

Next, I’m going to build a parameter grid for an random forest classifier. I got the parameter grid from this great article.

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Create the random grid
param_grid = {'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
                'n_estimators': n_estimators}

Now I will find the best hyperparameters for the random forest using GridSearchCV.

clf = RandomForestClassifier()
rf_gridsearch = GridSearchCV(estimator = clf, param_grid = param_grid,
                             cv = 3, verbose=2, n_jobs = -1)
%%time
rf_gridsearch.fit(df.drop('target', axis = 1), df['target'])
Fitting 3 folds for each of 720 candidates, totalling 2160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  8.8min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed: 23.0min
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed: 43.3min
[Parallel(n_jobs=-1)]: Done 997 tasks      | elapsed: 69.0min
[Parallel(n_jobs=-1)]: Done 1442 tasks      | elapsed: 102.5min
[Parallel(n_jobs=-1)]: Done 1969 tasks      | elapsed: 139.3min
[Parallel(n_jobs=-1)]: Done 2160 out of 2160 | elapsed: 152.9min finished


CPU times: user 14.3 s, sys: 627 ms, total: 14.9 s
Wall time: 2h 32min 59s





GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100,
                                       110, None],
                         'max_features': ['auto', 'sqrt'],
                         'min_samples_split': [2, 5, 10],
                         'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400,
                                          1600, 1800, 2000]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=2)
rf_gridsearch.best_score_
0.9072

Using gridsearch took 2 hours and 32 minutes and got a accuracy of 90.7%. Now I will try using RandomizedSearch.

rf_random = RandomizedSearchCV(estimator = clf, param_distributions = param_grid, n_iter = 100,
                               cv = 3, verbose=2, random_state=11, n_jobs = -1)
%%time
rf_random.fit(df.drop('target', axis = 1), df['target'])
Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed: 10.0min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 20.9min finished


CPU times: user 7.08 s, sys: 190 ms, total: 7.27 s
Wall time: 21min 2s





RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    n_estimators='warn',
                                                    n_jobs=None,
                                                    oob_sc...
                                                    verbose=0,
                                                    warm_start=False),
                   iid='warn', n_iter=100, n_jobs=-1,
                   param_distributions={'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   pre_dispatch='2*n_jobs', random_state=11, refit=True,
                   return_train_score=False, scoring=None, verbose=2)
rf_random.best_score_
0.907

Using randomizedsearch I got an accuracy of 90.7% and it took 21 minutes to train. So, I got the same accuracy in significantly less training time using randomizedsearch.