Optimising random forest hyperparameters
Typically the hyper-parameters which will have the most significant impact on the behaviour of a random forest are the following:
he number of decision trees in a random forest
umber of samples in bootstrap dataset
We will look at each of these hyper-parameters individually with examples of how to select them.
Data
To understand how we can optimise the hyperparameters in a random forest model, we will use scikit-learn’s RandomForestClassifier
and a subset of Titanic1 dataset.
First, we will import the features and labels using Pandas.
import pandas as pd
train_features = pd.read_csv("data/svm-hyperparameters-train-features.csv")
train_label = pd.read_csv("data/svm-hyperparameters-train-label.csv")
Let’s look at a random sample of entries from this dataset, both for features and labels.
train_features.sample(10)
Some of the available features are:
Pclass
, ticket classSex
Age
, age in yearsSibsp
, number of siblings/spouses aboardParch
, number of parents/children aboardFare
, passenger faretrain_label.sample(10)
The outcome label indicates whether a passenger survived the disaster.
As part of the typical initial steps for model training, we will prepare the data by splitting it into a training and testing subset.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(train_features, train_label, test_size=0.33, random_state=23)
Naive model
First we will train a “naive” model, that is a model using the defaults provided by RandomForestClassifier
2. These defaults are:
n_estimators = 10
criterion=’gini’
max_depth=None
min_samples_split=2
min_samples_leaf=1
min_weight_fraction_leaf=0.0
max_features=’auto’
max_leaf_nodes=None
min_impurity_decrease=0.0
min_impurity_split=None
bootstrap=True
oob_score=False
n_jobs=1
random_state=None
verbose=0
warm_start=False
class_weight=None
We will instantiate a random forest classifier:
from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier()
And training it using the
X_train
andy_train
subsets using the appropriatefit
method3.true_labels = train_label.values.ravel() rf.fit(X_train, y_train.values.ravel())
We can now evaluate trained naive model’s score.
from sklearn.metrics import precision_score predicted_labels = rf.predict(X_test) precision_score(y_test, predicted_labels)
Hyperparameter search
A simple example of a generic hyperparameter search using the GridSearchCV
method in scikit-learn
. The score used to measure the “best” model is the mean_test_score
, but other metrics could be used, such as the Out-of-bag (OOB) error.
parameters = {
"n_estimators":[5,10,50,100,250],
"max_depth":[2,4,8,16,32,None]
}
rfc = RandomForestClassifier()
from sklearn.model_selection import GridSearchCV
cv = GridSearchCV(rfc,parameters,cv=5)
cv.fit(X_train, y_train.values.ravel())
def display(results):
print(f'Best parameters are: {results.best_params_}')
print("\n")
mean_score = results.cv_results_['mean_test_score']
std_score = results.cv_results_['std_test_score']
params = results.cv_results_['params']
for mean,std,params in zip(mean_score,std_score,params):
print(f'{round(mean,3)} + or -{round(std,3)} for the {params}')
display(cv)
Parameters
Number of decision trees
This is specified using the n_estimators
hyper-parameter on the random forest initialisation.
Typically, a higher number of trees will lead to greater accuracy at the expense of model size and training time.
cv = GridSearchCV(rfc,{"n_estimators":[2, 4, 8, 16, 32, 64, 128, 256, 512]},cv=5)
cv.fit(X_train, y_train.values.ravel())
results = pd.DataFrame({"n_estimators": [param["n_estimators"] for param in cv.cv_results_['params']],
"mean_score": list(cv.cv_results_['mean_test_score']),
"std_score": cv.cv_results_['std_test_score']})
results
from plotnine import *
(
ggplot(results) + geom_boxplot(aes(x='factor(n_estimators)', y='mean_score')) +
geom_errorbar(aes(x='factor(n_estimators)', ymin='mean_score - std_score', ymax='mean_score + std_score')) +
theme_classic() + xlab('Number of trees') + ylab('Mean score')
)
The split criteria
At each node, a random forest decides, according to a specific algorithm, which feature and value split the tree. Therefore, the choice of splitting algorithm is crucial for the random forest’s performance.
Since, in this example, we are dealing with a classification problem, the choices of split algorithm are, for instance:
Gini
Entropy
If we were dealing with a random forest for regression, other methods (such as MSE) would be a possible choice. We will now compare both split algorithms as specified above, in training a random forest with our data:
rfc = RandomForestClassifier(n_estimators=256) cv = GridSearchCV(rfc,{"criterion": ["gini", "entropy"]},cv=5) cv.fit(X_train, y_train.values.ravel())
results = pd.DataFrame({"criterion": [param["criterion"] for param in cv.cv_results_['params']], "mean_score": list(cv.cv_results_['mean_test_score']), "std_score": cv.cv_results_['std_test_score']}) results
Maximum depth of individual trees
In theory, the “longer” the tree, the more splits it can have and better accommodate the data. However, at the tree level can this can lead to overfitting. Although this is a problem for decision trees, it is not necessarily a problem for the ensemble, the random forest. Although the key is to strike a balance between trees that aren’t too large or too short, there’s no universal heuristic to determine the size. Let’s try a few option for maximum depth:
rfc = RandomForestClassifier(n_estimators=256,
criterion="entropy")
cv = GridSearchCV(rfc,{'max_depth': [2, 4, 8, 16, 32, None]},cv=5)
cv.fit(X_train, y_train.values.ravel())
results = pd.DataFrame({"max_depth": [param["max_depth"] for param in cv.cv_results_['params']],
"mean_score": list(cv.cv_results_['mean_test_score']),
"std_score": cv.cv_results_['std_test_score']})
results = results.dropna()
results
from plotnine import *
(
ggplot(results) + geom_boxplot(aes(x='factor(max_depth)', y='mean_score')) +
theme_classic() + xlab('Max tree depth') + ylab('Mean score')
)
Maximum number of leaf nodes
This hyperparameter can be of importance to other topics, such as explainability.
It is specified in
scikit-learn
using themax_leaf_nodes
parameter. Let’s try a few different values:rfc = RandomForestClassifier(n_estimators=256, criterion="entropy", max_depth=8) cv = GridSearchCV(rfc,{'max_leaf_nodes': [2**i for i in range(1, 8)]},cv=5) cv.fit(X_train, y_train.values.ravel())
results = pd.DataFrame({"max_leaf_nodes": [param["max_leaf_nodes"] for param in cv.cv_results_['params']], "mean_score": list(cv.cv_results_['mean_test_score']), "std_score": cv.cv_results_['std_test_score']}) results = results.dropna() results
from plotnine import * ( ggplot(results) + geom_boxplot(aes(x='factor(max_leaf_nodes)', y='mean_score')) + geom_errorbar(aes(x='factor(max_leaf_nodes)', ymin='mean_score - std_score', ymax='mean_score + std_score')) + theme_classic() + xlab('Maximum leaf nodes') + ylab('Mean score') )
Random features per split
This is an important hyperparameter that will depend on how noisy the original data is. Typically, if the data is not very noisy, the number of used random features can be kept low. Otherwise, it needs to be kept high.
An important consideration is also the following trade-off:
A low number of random features decrease the forest’s overall variance
A low number of random features increases the bias
A high number of random features increases computational time
In
scikit-learn
this is specified with themax_features
parameter. Assuming $N_f$ is the total number of features, some possible values for this parameter are:sqrt
, this will take themax_features
as the rounded $\sqrt{N_f}$log2
, as above, takes the $\log_2(N_f)$The actual maximum number of features can be directly specified
Let’s try a simple benchmark, even though our data does not have many features to begin with:
rfc = RandomForestClassifier(n_estimators=256, criterion="entropy", max_depth=8) cv = GridSearchCV(rfc,{'max_features': ["sqrt", "log2", 1, 2, 3, 4, 5, 6]},cv=5) cv.fit(X_train, y_train.values.ravel())
results = pd.DataFrame({"max_features": [param["max_features"] for param in cv.cv_results_['params']], "mean_score": list(cv.cv_results_['mean_test_score']), "std_score": cv.cv_results_['std_test_score']}) results
from plotnine import * ( ggplot(results) + geom_boxplot(aes(x='factor(max_features)', y='mean_score')) + geom_errorbar(aes(x='factor(max_features)', ymin='mean_score - std_score', ymax='mean_score + std_score')) + theme_classic() + xlab('Maximum number of features') + ylab('Mean score') )
Bootstrap dataset size
This hyperparameter relates to the proportion of the training data to be used by decision trees.
It is specified in scikit-learn
by max_samples
and can take the value of either:
None
, take the entirety of the samplesAn integer, representing the actual number of samples
A float, representing a proportion between
0
and1
or the samples to take.Let’s try a hyperparameter search with some values:
rfc = RandomForestClassifier(n_estimators=256, criterion="entropy", max_depth=8, max_features=6) cv = GridSearchCV(rfc,{'max_samples': [i/10.0 for i in range(1, 10)]},cv=5) cv.fit(X_train, y_train.values.ravel())
results = pd.DataFrame({"max_samples": [param["max_samples"] for param in cv.cv_results_['params']], "mean_score": list(cv.cv_results_['mean_test_score']), "std_score": cv.cv_results_['std_test_score']}) results
from plotnine import * ( ggplot(results) + geom_boxplot(aes(x='factor(max_samples)', y='mean_score')) + geom_errorbar(aes(x='factor(max_samples)', ymin='mean_score - std_score', ymax='mean_score + std_score')) + theme_classic() + xlab('Proportion bootstrap samples') + ylab('Mean score') )