GridSearch vs RandomizedSearch for Hyperparameter Tuning
I’m going to compare the performance of tuning the hyperparameters with GridSearchCV and RandomizedSearchCV. First, I am going to create a dataset that has 10,000 samples and 20 features.
from sklearn.datasets import make_classification
import numpy as np
import pandas as pd
# make dataset
X, y = make_classification(n_samples = 10000,
n_features=20,
n_informative=4,
n_redundant=0,
random_state=11)
df = pd.DataFrame(X)
df['target'] = y
df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.520216 | -0.299523 | 1.697775 | 0.152835 | -0.071976 | 0.002353 | 0.057001 | 1.656589 | 0.059377 | 0.634026 | ... | 0.230848 | -2.133668 | -0.658056 | 0.227366 | -1.005542 | -0.533868 | -0.656252 | -1.167656 | -0.902226 | 0 |
1 | 0.252605 | 1.432791 | 1.561181 | -1.456888 | -0.325153 | -1.757407 | 1.183243 | 0.931166 | 0.967256 | -1.833468 | ... | -1.644497 | 1.259892 | 1.355751 | -1.085283 | -1.347220 | -0.073796 | 0.718362 | -2.334630 | 1.531651 | 0 |
2 | -1.118205 | -0.335938 | -0.979303 | 0.188338 | -0.346252 | -1.263341 | -1.037886 | -0.870959 | 2.105311 | 0.892956 | ... | 0.794894 | 0.796176 | 0.193527 | -2.070266 | -1.183444 | -0.231885 | 1.581976 | 1.110054 | 1.610723 | 1 |
3 | 0.334311 | 1.568198 | -0.423843 | -0.962124 | 1.060851 | -3.596107 | -0.416077 | -0.602925 | -0.523378 | 0.834385 | ... | -0.636568 | -2.537476 | -0.355572 | 1.032740 | 0.195867 | -0.227352 | -0.332308 | 0.813405 | -1.037039 | 1 |
4 | -0.803574 | -0.573973 | 2.605967 | 0.600801 | 0.823409 | 0.494084 | -0.398244 | 1.332191 | 0.273173 | 1.089310 | ... | -1.030162 | -1.252967 | 1.109795 | -1.197247 | -0.681647 | -0.786710 | 0.833898 | -0.258752 | 0.161887 | 0 |
5 rows × 21 columns
Next, I’m going to build a parameter grid for an random forest classifier. I got the parameter grid from this great article.
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Create the random grid
param_grid = {'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'n_estimators': n_estimators}
Now I will find the best hyperparameters for the random forest using GridSearchCV.
clf = RandomForestClassifier()
rf_gridsearch = GridSearchCV(estimator = clf, param_grid = param_grid,
cv = 3, verbose=2, n_jobs = -1)
%%time
rf_gridsearch.fit(df.drop('target', axis = 1), df['target'])
Fitting 3 folds for each of 720 candidates, totalling 2160 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 25 tasks | elapsed: 1.6min
[Parallel(n_jobs=-1)]: Done 146 tasks | elapsed: 8.8min
[Parallel(n_jobs=-1)]: Done 349 tasks | elapsed: 23.0min
[Parallel(n_jobs=-1)]: Done 632 tasks | elapsed: 43.3min
[Parallel(n_jobs=-1)]: Done 997 tasks | elapsed: 69.0min
[Parallel(n_jobs=-1)]: Done 1442 tasks | elapsed: 102.5min
[Parallel(n_jobs=-1)]: Done 1969 tasks | elapsed: 139.3min
[Parallel(n_jobs=-1)]: Done 2160 out of 2160 | elapsed: 152.9min finished
CPU times: user 14.3 s, sys: 627 ms, total: 14.9 s
Wall time: 2h 32min 59s
GridSearchCV(cv=3, error_score='raise-deprecating',
estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
criterion='gini', max_depth=None,
max_features='auto',
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators='warn', n_jobs=None,
oob_score=False,
random_state=None, verbose=0,
warm_start=False),
iid='warn', n_jobs=-1,
param_grid={'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100,
110, None],
'max_features': ['auto', 'sqrt'],
'min_samples_split': [2, 5, 10],
'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400,
1600, 1800, 2000]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=2)
rf_gridsearch.best_score_
0.9072
Using gridsearch took 2 hours and 32 minutes and got a accuracy of 90.7%. Now I will try using RandomizedSearch.
rf_random = RandomizedSearchCV(estimator = clf, param_distributions = param_grid, n_iter = 100,
cv = 3, verbose=2, random_state=11, n_jobs = -1)
%%time
rf_random.fit(df.drop('target', axis = 1), df['target'])
Fitting 3 folds for each of 100 candidates, totalling 300 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 25 tasks | elapsed: 2.1min
[Parallel(n_jobs=-1)]: Done 146 tasks | elapsed: 10.0min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 20.9min finished
CPU times: user 7.08 s, sys: 190 ms, total: 7.27 s
Wall time: 21min 2s
RandomizedSearchCV(cv=3, error_score='raise-deprecating',
estimator=RandomForestClassifier(bootstrap=True,
class_weight=None,
criterion='gini',
max_depth=None,
max_features='auto',
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators='warn',
n_jobs=None,
oob_sc...
verbose=0,
warm_start=False),
iid='warn', n_iter=100, n_jobs=-1,
param_distributions={'max_depth': [10, 20, 30, 40, 50, 60,
70, 80, 90, 100, 110,
None],
'max_features': ['auto', 'sqrt'],
'min_samples_split': [2, 5, 10],
'n_estimators': [200, 400, 600, 800,
1000, 1200, 1400, 1600,
1800, 2000]},
pre_dispatch='2*n_jobs', random_state=11, refit=True,
return_train_score=False, scoring=None, verbose=2)
rf_random.best_score_
0.907
Using randomizedsearch I got an accuracy of 90.7% and it took 21 minutes to train. So, I got the same accuracy in significantly less training time using randomizedsearch.