XGBoost

Introduction

XGBoost¹ is a popular regularizing gradient boosting framework.

Installation

In most systems, installing XGBoost can be done simply by using pip

$ pip install xgboost

Example

Training XGBoost with the credit-bias dataset.

import pandas as pd

data = pd.read_csv("../data/credit-bias-train.zip")
data.head()

	NewCreditCustomer	Amount	Interest	LoanDuration	Education	NrOfDependants	EmploymentDurationCurrentEmployer	IncomeFromPrincipalEmployer	IncomeFromPension	…	Mortgage	Owner	Tenant	Entrepreneur	Fully	Retiree
0	False	2125.0	20.97	60	4.0	0.0	6.0	0.0	301.0	…	0	1	0	0	0	1
1	False	3000.0	17.12	60	5.0	0.0	6.0	900.0	0.0	…	0	1	0	1	0	0
2	True	9100.0	13.67	60	4.0	1.0	3.0	600.0	0.0	…	1	0	0	1	0	0
3	True	635.0	42.66	60	2.0	0.0	1.0	745.0	0.0	…	0	0	1	0	1	0
4	False	5000.0	24.52	60	4.0	1.0	5.0	1000.0	0.0	…	0	0	0	0	1	0

5 rows × 40 columns

X_df = data.drop('PaidLoan', axis=1)
y_df = data['PaidLoan']
y_df.describe()

count     58003
unique        2
top        True
freq      29219
Name: PaidLoan, dtype: object

from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(X_df, y_df, test_size=0.25, random_state=42)

Hyperparameter estimation

Runs a grid search to find the tuning parameters that maxisimise the area under the curve (AUC). train_x is the training data frame with loan details and train_y is the default target column for training. The method returns the best parameters and corresponding AUC score.

The objective parameter² specifies the learning task and the corresponding learning objective. Possible values include:

Objective function

reg:squarederror, regression with squared loss.
reg:squaredlogerror, regression with squared log loss
reg:logistic, logistic regression
reg:pseudohubererror, regression with Pseudo Huber loss, a twice differentiable alternative to absolute loss.
binary:logistic, logistic regression for binary classification, output probability
binary:logitraw, logistic regression for binary classification, output score before logistic transformation
binary:hinge, hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.
count:poisson, poisson regression for count data, output mean of Poisson distribution
survival:cox, Cox regression for right censored survival time data (negative values are considered right censored).
survival:aft, Accelerated failure time model for censored survival time data. See Survival Analysis with Accelerated Failure Time for details.
aft_loss_distribution, Probability Density Function used by survival:aft objective and aft-nloglik metric.
multi:softmax, set XGBoost to do multiclass classification using the softmax objective, you also need to set _{num_class} (number of classes)
multi:softprob, same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata * nclass matrix. The result contains predicted probability of each data point belonging to each class.
rank:pairwise, Use LambdaMART to perform pairwise ranking where the pairwise loss is minimized
rank:ndcg, Use LambdaMART to perform list-wise ranking where Normalized Discounted Cumulative Gain (NDCG) is maximized
rank:map, Use LambdaMART to perform list-wise ranking where Mean Average Precision (MAP) is maximized
reg:gamma, gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed.
reg:tweedie, Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.

Weight balance

scale_pos_weight (default 1) controls the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider is sum(negative instances) / sum(positive instances).

from sklearn.model_selection import GridSearchCV
from xgboost.sklearn import XGBClassifier
from typing import Tuple

def find_best_xgboost_model(train_x: pd.DataFrame, train_y: pd.Series) -> Tuple[dict, float]:
    scale_pos_weight = (len(train_y) - train_y.sum()) / train_y.sum()

    param_test = {
            'max_depth': [1, 2, 4, 8],
            'learning_rate': [0.05, 0.06, 0.07],
            'n_estimators': [10, 50, 100]
        }

    gsearch = GridSearchCV(estimator=XGBClassifier(
        use_label_encoder=False,
        objective='binary:logistic',
        scale_pos_weight=scale_pos_weight,
        tree_method = "hist",
        seed=27),
        param_grid=param_test, scoring='roc_auc', n_jobs=-1, cv=8)

    gsearch.fit(train_x, train_y)

    return gsearch.best_params_, gsearch.best_score_
best_params, best_score = find_best_xgboost_model(train_x, train_y)

Using the xgboost model parameters, it predicts the probabilities of defaulting.

best_params_, best tuning parameters
train_x, training dataframe with loan details
train_y, default target column for training
test_x, testing dataframe with loan details
test_y, default target column for testing

The result is a series of probabilities whether loan entry will default or not and corresponding model’s AUC score

from sklearn.metrics import roc_auc_score

def xgboost_predict(best_params_: dict, train_x: pd.DataFrame, train_y: pd.Series, test_x: pd.DataFrame,
                    test_y: pd.Series) -> Tuple[list, float]:
    scale_pos_weight = (len(train_y) - train_y.sum()) / train_y.sum()
    xgb_model = XGBClassifier(objective='binary:logistic',
                              scale_pos_weight=scale_pos_weight,
                              seed=27,
                              max_depth=best_params_['max_depth'],
                              learning_rate=best_params_['learning_rate'],
                              n_estimators=best_params_['n_estimators']
                              )

    xgb_model.fit(train_x, train_y)
    predicted_probabilities_ = xgb_model.predict_proba(test_x)[:, 1]
    auc_ = roc_auc_score(test_y, predicted_probabilities_)

    return predicted_probabilities_, auc_
predicted_probabilities, auc = xgboost_predict(best_params, train_x, train_y, test_x, test_y)
print("AUC: {}".format(auc))

AUC: 0.7356799122465589

Filters the original loan dataframe to just include the loans from the test dataframe and then it adds the predicted probabilities.

loans_df_, original loan dataframe
test_index, indices from the test dataframes
predicted_probabilities_, the probabilities forecasted by the XGBoost model

Returns the loans dataframe with predictions

import numpy as np

def prepare_test_with_predictions(loans_df_: pd.DataFrame, test_index: pd.Index, predicted_probabilities_: np.array)\
        ->pd.DataFrame:
    loan_test_df = loans_df_.loc[test_index]
    loan_test_df['predicted_probabilities'] = predicted_probabilities_
    return loan_test_df
loans_with_predictions_df = prepare_test_with_predictions(data, test_x.index, predicted_probabilities)
loans_with_predictions_df.head()

	NewCreditCustomer	Amount	Interest	LoanDuration	Education	NrOfDependants	EmploymentDurationCurrentEmployer	IncomeFromPrincipalEmployer	…	Owner	Owner_with_encumbrance	Tenant	Fully	predicted_probabilities
30299	False	530.0	10.68	36	4.0	NaN	5.0	0.0	…	0	0	1	0	0.641520
34126	False	530.0	21.57	24	4.0	NaN	1.0	0.0	…	0	0	0	0	0.770486
11200	False	2300.0	15.62	36	4.0	0.0	6.0	1159.0	…	1	0	0	1	0.748680
25133	True	530.0	27.36	36	4.0	NaN	6.0	0.0	…	0	1	0	0	0.469619
42758	True	4250.0	18.94	60	4.0	NaN	1.0	0.0	…	0	0	0	0	0.527547

5 rows × 41 columns

Visualisation

import seaborn as sns

sns.histplot(loans_with_predictions_df['predicted_probabilities'], stat='density')

<AxesSubplot:xlabel='predicted_probabilities', ylabel='Density'>

ROC and AUC

Based on actuals and predicted values³, it calculates their false positive rate (fpr), the true positive rate (tpr). It also returns the corresponding thresholds used as well as the value for the area under the curve.

actuals, series of actual values indicating whether the loan defaulted or not predicted_probabilities, series of predicted probabilities of the loan defaulting Return a unique series of false and true positive rates with corresponding series of thresholds and value for total area under the curve.

from sklearn.metrics import roc_curve, auc

def get_roc_auc_data(actuals: pd.Series, predicted_probabilities: pd.Series) -> Tuple[np.array, np.array, np.array, float]:
    fpr, tpr, thresholds = roc_curve(actuals, predicted_probabilities, pos_label=1)
    auc_score = auc(fpr, tpr)
    return fpr, tpr, thresholds, auc_score

fpr, tpr, thresholds, auc_score = get_roc_auc_data(loans_with_predictions_df['PaidLoan'], loans_with_predictions_df['predicted_probabilities'])
sns.histplot(fpr)

<AxesSubplot:ylabel='Count'>

Footnotes

https://github.com/dmlc/xgboost↩︎
https://xgboost.readthedocs.io/en/stable/parameter.html#learning-task-parameters↩︎
See ROC.↩︎