Optimising random forest hyperparameters
Typically the hyper-parameters which will have the most significant impact on the behaviour of a random forest are the following:
he number of decision trees in a random forest
umber of samples in bootstrap dataset
We will look at each of these hyper-parameters individually with examples of how to select them. ## Data
To understand how we can optimise the hyperparameters in a random forest model, we will use scikit-learn’s RandomForestClassifier
and a subset of Titanic1 dataset.
First, we will import the features and labels using pages/site/Pandas.
import pandas as pd
= pd.read_csv("data/svm-hyperparameters-train-features.csv")
train_features = pd.read_csv("data/svm-hyperparameters-train-label.csv") train_label
Let’s look at a random sample of entries from this dataset, both for features and labels.
10) train_features.sample(
Some of the available features are: - Pclass
, ticket class - Sex
- Age
, age in years - Sibsp
, number of siblings/spouses aboard - Parch
, number of parents/children aboard - Fare
, passenger fare
10) train_label.sample(
The outcome label indicates whether a passenger survived the disaster.
As part of the typical initial steps for model training, we will prepare the data by splitting it into a training and testing subset.
from sklearn.model_selection import train_test_split
= train_test_split(train_features, train_label, test_size=0.33, random_state=23) X_train, X_test, y_train, y_test
Naive model
First we will train a “naive” model, that is a model using the defaults provided by RandomForestClassifier
2. These defaults are: - n_estimators = 10
- criterion=’gini’
- max_depth=None
- min_samples_split=2
- min_samples_leaf=1
- min_weight_fraction_leaf=0.0
- max_features=’auto’
- max_leaf_nodes=None
- min_impurity_decrease=0.0
- min_impurity_split=None
- bootstrap=True
- oob_score=False
- n_jobs=1
- random_state=None
- verbose=0
- warm_start=False
- class_weight=None
We will instantiate a random forest classifier:
from sklearn.ensemble import RandomForestClassifier
= RandomForestClassifier() rf
And training it using the X_train
and y_train
subsets using the appropriate fit
method3.
= train_label.values.ravel()
true_labels
rf.fit(X_train, y_train.values.ravel())
We can now evaluate trained naive model’s score.
from sklearn.metrics import precision_score
= rf.predict(X_test)
predicted_labels
precision_score(y_test, predicted_labels)
Hyperparameter search
A simple example of a generic hyperparameter search using the GridSearchCV
method in scikit-learn
. The score used to measure the “best” model is the mean_test_score
, but other metrics could be used, such as the Out-of-bag (OOB) error.
= {
parameters "n_estimators":[5,10,50,100,250],
"max_depth":[2,4,8,16,32,None]
}
= RandomForestClassifier() rfc
from sklearn.model_selection import GridSearchCV
= GridSearchCV(rfc,parameters,cv=5)
cv cv.fit(X_train, y_train.values.ravel())
def display(results):
print(f'Best parameters are: {results.best_params_}')
print("\n")
= results.cv_results_['mean_test_score']
mean_score = results.cv_results_['std_test_score']
std_score = results.cv_results_['params']
params for mean,std,params in zip(mean_score,std_score,params):
print(f'{round(mean,3)} + or -{round(std,3)} for the {params}')
display(cv)
Parameters
Number of decision trees
This is specified using the n_estimators
hyper-parameter on the random forest initialisation.
Typically, a higher number of trees will lead to greater accuracy at the expense of model size and training time.
= GridSearchCV(rfc,{"n_estimators":[2, 4, 8, 16, 32, 64, 128, 256, 512]},cv=5)
cv cv.fit(X_train, y_train.values.ravel())
= pd.DataFrame({"n_estimators": [param["n_estimators"] for param in cv.cv_results_['params']],
results "mean_score": list(cv.cv_results_['mean_test_score']),
"std_score": cv.cv_results_['std_test_score']})
results
from plotnine import *
(+ geom_boxplot(aes(x='factor(n_estimators)', y='mean_score')) +
ggplot(results) ='factor(n_estimators)', ymin='mean_score - std_score', ymax='mean_score + std_score')) +
geom_errorbar(aes(x+ xlab('Number of trees') + ylab('Mean score')
theme_classic() )
The split criteria
At each node, a random forest decides, according to a specific algorithm, which feature and value split the tree. Therefore, the choice of splitting algorithm is crucial for the random forest’s performance.
Since, in this example, we are dealing with a classification problem, the choices of split algorithm are, for instance: - Gini - Entropy
If we were dealing with a random forest for regression, other methods (such as MSE) would be a possible choice. We will now compare both split algorithms as specified above, in training a random forest with our data:
= RandomForestClassifier(n_estimators=256)
rfc
= GridSearchCV(rfc,{"criterion": ["gini", "entropy"]},cv=5)
cv cv.fit(X_train, y_train.values.ravel())
= pd.DataFrame({"criterion": [param["criterion"] for param in cv.cv_results_['params']],
results "mean_score": list(cv.cv_results_['mean_test_score']),
"std_score": cv.cv_results_['std_test_score']})
results
Maximum depth of individual trees
In theory, the “longer” the tree, the more splits it can have and better accommodate the data. However, at the tree level can this can lead to overfitting. Although this is a problem for decision trees, it is not necessarily a problem for the ensemble, the random forest. Although the key is to strike a balance between trees that aren’t too large or too short, there’s no universal heuristic to determine the size. Let’s try a few option for maximum depth:
= RandomForestClassifier(n_estimators=256,
rfc ="entropy")
criterion
= GridSearchCV(rfc,{'max_depth': [2, 4, 8, 16, 32, None]},cv=5)
cv cv.fit(X_train, y_train.values.ravel())
= pd.DataFrame({"max_depth": [param["max_depth"] for param in cv.cv_results_['params']],
results "mean_score": list(cv.cv_results_['mean_test_score']),
"std_score": cv.cv_results_['std_test_score']})
= results.dropna()
results results
from plotnine import *
(+ geom_boxplot(aes(x='factor(max_depth)', y='mean_score')) +
ggplot(results) + xlab('Max tree depth') + ylab('Mean score')
theme_classic() )
Maximum number of leaf nodes
- This hyperparameter can be of importance to other topics, such as explainability.
- It is specified in
scikit-learn
using themax_leaf_nodes
parameter. Let’s try a few different values: = RandomForestClassifier(n_estimators=256, rfc ="entropy", criterion=8) max_depth = GridSearchCV(rfc,{'max_leaf_nodes': [2**i for i in range(1, 8)]},cv=5) cv cv.fit(X_train, y_train.values.ravel())
= pd.DataFrame({"max_leaf_nodes": [param["max_leaf_nodes"] for param in cv.cv_results_['params']], results "mean_score": list(cv.cv_results_['mean_test_score']), "std_score": cv.cv_results_['std_test_score']}) = results.dropna() results results
from plotnine import * (+ geom_boxplot(aes(x='factor(max_leaf_nodes)', y='mean_score')) + ggplot(results) ='factor(max_leaf_nodes)', ymin='mean_score - std_score', ymax='mean_score + std_score')) + geom_errorbar(aes(x+ xlab('Maximum leaf nodes') + ylab('Mean score') theme_classic() )
Random features per split
This is an important hyperparameter that will depend on how noisy the original data is. Typically, if the data is not very noisy, the number of used random features can be kept low. Otherwise, it needs to be kept high.
An important consideration is also the following trade-off: - A low number of random features decrease the forest’s overall variance - A low number of random features increases the bias - A high number of random features increases computational time
In scikit-learn
this is specified with the max_features
parameter. Assuming \(N_f\) is the total number of features, some possible values for this parameter are: - sqrt
, this will take the max_features
as the rounded \(\sqrt{N_f}\) - log2
, as above, takes the \(\log_2(N_f)\) - The actual maximum number of features can be directly specified
Let’s try a simple benchmark, even though our data does not have many features to begin with:
= RandomForestClassifier(n_estimators=256,
rfc ="entropy",
criterion=8)
max_depth
= GridSearchCV(rfc,{'max_features': ["sqrt", "log2", 1, 2, 3, 4, 5, 6]},cv=5)
cv cv.fit(X_train, y_train.values.ravel())
= pd.DataFrame({"max_features": [param["max_features"] for param in cv.cv_results_['params']],
results "mean_score": list(cv.cv_results_['mean_test_score']),
"std_score": cv.cv_results_['std_test_score']})
results
from plotnine import *
(+ geom_boxplot(aes(x='factor(max_features)', y='mean_score')) +
ggplot(results) ='factor(max_features)', ymin='mean_score - std_score', ymax='mean_score + std_score')) +
geom_errorbar(aes(x+ xlab('Maximum number of features') + ylab('Mean score')
theme_classic() )
Bootstrap dataset size
This hyperparameter relates to the proportion of the training data to be used by decision trees.
It is specified in scikit-learn
by max_samples
and can take the value of either: - None
, take the entirety of the samples - An integer, representing the actual number of samples - A float, representing a proportion between 0
and 1
or the samples to take.
Let’s try a hyperparameter search with some values:
= RandomForestClassifier(n_estimators=256,
rfc ="entropy",
criterion=8,
max_depth=6)
max_features
= GridSearchCV(rfc,{'max_samples': [i/10.0 for i in range(1, 10)]},cv=5)
cv cv.fit(X_train, y_train.values.ravel())
= pd.DataFrame({"max_samples": [param["max_samples"] for param in cv.cv_results_['params']],
results "mean_score": list(cv.cv_results_['mean_test_score']),
"std_score": cv.cv_results_['std_test_score']})
results
from plotnine import *
(+ geom_boxplot(aes(x='factor(max_samples)', y='mean_score')) +
ggplot(results) ='factor(max_samples)', ymin='mean_score - std_score', ymax='mean_score + std_score')) +
geom_errorbar(aes(x+ xlab('Proportion bootstrap samples') + ylab('Mean score')
theme_classic() )