Optimising random forest hyperparameters

Typically the hyper-parameters which will have the most significant impact on the behaviour of a random forest are the following:

Data

To understand how we can optimise the hyperparameters in a random forest model, we will use scikit-learn’s RandomForestClassifier and a subset of Titanic1 dataset.

First, we will import the features and labels using Pandas.

import pandas as pd

train_features = pd.read_csv("data/svm-hyperparameters-train-features.csv")
train_label = pd.read_csv("data/svm-hyperparameters-train-label.csv")

Let’s look at a random sample of entries from this dataset, both for features and labels.

train_features.sample(10)

Some of the available features are:

  • Pclass, ticket class

  • Sex

  • Age, age in years

  • Sibsp, number of siblings/spouses aboard

  • Parch, number of parents/children aboard

  • Fare, passenger fare

    train_label.sample(10)
    

    The outcome label indicates whether a passenger survived the disaster.

    As part of the typical initial steps for model training, we will prepare the data by splitting it into a training and testing subset.

    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(train_features, train_label, test_size=0.33, random_state=23)
    

Naive model

First we will train a “naive” model, that is a model using the defaults provided by RandomForestClassifier2. These defaults are:

  • n_estimators = 10

  • criterion=’gini’

  • max_depth=None

  • min_samples_split=2

  • min_samples_leaf=1

  • min_weight_fraction_leaf=0.0

  • max_features=’auto’

  • max_leaf_nodes=None

  • min_impurity_decrease=0.0

  • min_impurity_split=None

  • bootstrap=True

  • oob_score=False

  • n_jobs=1

  • random_state=None

  • verbose=0

  • warm_start=False

  • class_weight=None

    We will instantiate a random forest classifier:

    from sklearn.ensemble import RandomForestClassifier
    
    rf = RandomForestClassifier()
    

    And training it using the X_train and y_train subsets using the appropriate fit method3.

    true_labels = train_label.values.ravel()
    
    rf.fit(X_train, y_train.values.ravel())
    

    We can now evaluate trained naive model’s score.

    from sklearn.metrics import precision_score
    
    predicted_labels = rf.predict(X_test)
    
    precision_score(y_test, predicted_labels)
    

A simple example of a generic hyperparameter search using the GridSearchCV method in scikit-learn. The score used to measure the “best” model is the mean_test_score, but other metrics could be used, such as the Out-of-bag (OOB) error.

parameters = {
  "n_estimators":[5,10,50,100,250],
  "max_depth":[2,4,8,16,32,None]
  
}
rfc = RandomForestClassifier()
from sklearn.model_selection import GridSearchCV

cv = GridSearchCV(rfc,parameters,cv=5)
cv.fit(X_train, y_train.values.ravel())
def display(results):
  print(f'Best parameters are: {results.best_params_}')
  print("\n")
  mean_score = results.cv_results_['mean_test_score']
  std_score = results.cv_results_['std_test_score']
  params = results.cv_results_['params']
  for mean,std,params in zip(mean_score,std_score,params):
      print(f'{round(mean,3)} + or -{round(std,3)} for the {params}')
display(cv)

Parameters

Number of decision trees

This is specified using the n_estimators hyper-parameter on the random forest initialisation.

Typically, a higher number of trees will lead to greater accuracy at the expense of model size and training time.

cv = GridSearchCV(rfc,{"n_estimators":[2, 4, 8, 16, 32, 64, 128, 256, 512]},cv=5)
cv.fit(X_train, y_train.values.ravel())
results = pd.DataFrame({"n_estimators": [param["n_estimators"] for param in cv.cv_results_['params']],
           "mean_score": list(cv.cv_results_['mean_test_score']),
           "std_score": cv.cv_results_['std_test_score']})
results
from plotnine import *

(
   ggplot(results) + geom_boxplot(aes(x='factor(n_estimators)', y='mean_score')) +
  geom_errorbar(aes(x='factor(n_estimators)', ymin='mean_score - std_score', ymax='mean_score + std_score')) +
  theme_classic() + xlab('Number of trees') + ylab('Mean score')
)

The split criteria

At each node, a random forest decides, according to a specific algorithm, which feature and value split the tree. Therefore, the choice of splitting algorithm is crucial for the random forest’s performance.

Since, in this example, we are dealing with a classification problem, the choices of split algorithm are, for instance:

  • Gini

  • Entropy

    If we were dealing with a random forest for regression, other methods (such as MSE) would be a possible choice. We will now compare both split algorithms as specified above, in training a random forest with our data:

    rfc = RandomForestClassifier(n_estimators=256)
    
    cv = GridSearchCV(rfc,{"criterion": ["gini", "entropy"]},cv=5)
    cv.fit(X_train, y_train.values.ravel())
    
    results = pd.DataFrame({"criterion": [param["criterion"] for param in cv.cv_results_['params']],
               "mean_score": list(cv.cv_results_['mean_test_score']),
               "std_score": cv.cv_results_['std_test_score']})
    results
    

Maximum depth of individual trees

In theory, the “longer” the tree, the more splits it can have and better accommodate the data. However, at the tree level can this can lead to overfitting. Although this is a problem for decision trees, it is not necessarily a problem for the ensemble, the random forest. Although the key is to strike a balance between trees that aren’t too large or too short, there’s no universal heuristic to determine the size. Let’s try a few option for maximum depth:

rfc = RandomForestClassifier(n_estimators=256,
                         criterion="entropy")

cv = GridSearchCV(rfc,{'max_depth': [2, 4, 8, 16, 32, None]},cv=5)
cv.fit(X_train, y_train.values.ravel())
results = pd.DataFrame({"max_depth": [param["max_depth"] for param in cv.cv_results_['params']],
           "mean_score": list(cv.cv_results_['mean_test_score']),
           "std_score": cv.cv_results_['std_test_score']})
results = results.dropna()
results
from plotnine import *

(
   ggplot(results) + geom_boxplot(aes(x='factor(max_depth)', y='mean_score')) +
  theme_classic() + xlab('Max tree depth') + ylab('Mean score')
)
  • Maximum number of leaf nodes

    • This hyperparameter can be of importance to other topics, such as explainability.

    • It is specified in scikit-learn using the max_leaf_nodes parameter. Let’s try a few different values:

    • rfc = RandomForestClassifier(n_estimators=256,
                               criterion="entropy",
                                max_depth=8)
      
      cv = GridSearchCV(rfc,{'max_leaf_nodes': [2**i for i in range(1, 8)]},cv=5)
      cv.fit(X_train, y_train.values.ravel())
      
      results = pd.DataFrame({"max_leaf_nodes": [param["max_leaf_nodes"] for param in cv.cv_results_['params']],
                 "mean_score": list(cv.cv_results_['mean_test_score']),
                 "std_score": cv.cv_results_['std_test_score']})
      results = results.dropna()
      results
      
      from plotnine import *
      
      (
         ggplot(results) + geom_boxplot(aes(x='factor(max_leaf_nodes)', y='mean_score')) +
        geom_errorbar(aes(x='factor(max_leaf_nodes)', ymin='mean_score - std_score', ymax='mean_score + std_score')) +
        theme_classic() + xlab('Maximum leaf nodes') + ylab('Mean score')
      )
      

Random features per split

This is an important hyperparameter that will depend on how noisy the original data is. Typically, if the data is not very noisy, the number of used random features can be kept low. Otherwise, it needs to be kept high.

An important consideration is also the following trade-off:

  • A low number of random features decrease the forest’s overall variance

  • A low number of random features increases the bias

  • A high number of random features increases computational time

    In scikit-learn this is specified with the max_features parameter. Assuming $N_f$ is the total number of features, some possible values for this parameter are:

  • sqrt, this will take the max_features as the rounded $\sqrt{N_f}$

  • log2, as above, takes the $\log_2(N_f)$

  • The actual maximum number of features can be directly specified

    Let’s try a simple benchmark, even though our data does not have many features to begin with:

    rfc = RandomForestClassifier(n_estimators=256,
                               criterion="entropy",
                               max_depth=8)
    
    cv = GridSearchCV(rfc,{'max_features': ["sqrt", "log2", 1, 2, 3, 4, 5, 6]},cv=5)
    cv.fit(X_train, y_train.values.ravel())
    
    results = pd.DataFrame({"max_features": [param["max_features"] for param in cv.cv_results_['params']],
               "mean_score": list(cv.cv_results_['mean_test_score']),
               "std_score": cv.cv_results_['std_test_score']})
    results
    
    from plotnine import *
    
    (
       ggplot(results) + geom_boxplot(aes(x='factor(max_features)', y='mean_score')) +
      geom_errorbar(aes(x='factor(max_features)', ymin='mean_score - std_score', ymax='mean_score + std_score')) +
      theme_classic() + xlab('Maximum number of features') + ylab('Mean score')
    )
    

Bootstrap dataset size

This hyperparameter relates to the proportion of the training data to be used by decision trees.

It is specified in scikit-learn by max_samples and can take the value of either:

  • None, take the entirety of the samples

  • An integer, representing the actual number of samples

  • A float, representing a proportion between 0 and 1 or the samples to take.

    Let’s try a hyperparameter search with some values:

    rfc = RandomForestClassifier(n_estimators=256,
                               criterion="entropy",
                               max_depth=8,
                               max_features=6)
    
    cv = GridSearchCV(rfc,{'max_samples': [i/10.0 for i in range(1, 10)]},cv=5)
    cv.fit(X_train, y_train.values.ravel())
    
    results = pd.DataFrame({"max_samples": [param["max_samples"] for param in cv.cv_results_['params']],
               "mean_score": list(cv.cv_results_['mean_test_score']),
               "std_score": cv.cv_results_['std_test_score']})
    results
    
    from plotnine import *
    
    (
       ggplot(results) + geom_boxplot(aes(x='factor(max_samples)', y='mean_score')) +
      geom_errorbar(aes(x='factor(max_samples)', ymin='mean_score - std_score', ymax='mean_score + std_score')) +
      theme_classic() + xlab('Proportion bootstrap samples') + ylab('Mean score')
    )