XGBoost1 is a popular regularizing gradient boosting framework.


In most systems, installing XGBoost can be done simply by using pip

$ pip install xgboost


Training XGBoost with the credit-bias dataset.

import pandas as pd

data = pd.read_csv("../data/credit-bias-train.zip")


5 rows × 40 columns

X_df = data.drop('PaidLoan', axis=1)
y_df = data['PaidLoan']
count     58003
unique        2
top        True
freq      29219
Name: PaidLoan, dtype: object
from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(X_df, y_df, test_size=0.25, random_state=42)

Hyperparameter estimation

Runs a grid search to find the tuning parameters that maxisimise the area under the curve (AUC). train_x is the training data frame with loan details and train_y is the default target column for training. The method returns the best parameters and corresponding AUC score.

The objective parameter2 specifies the learning task and the corresponding learning objective. Possible values include:

Objective function

  • reg:squarederror, regression with squared loss.
  • reg:squaredlogerror, regression with squared log loss
  • reg:logistic, logistic regression
  • reg:pseudohubererror, regression with Pseudo Huber loss, a twice differentiable alternative to absolute loss.
  • binary:logistic, logistic regression for binary classification, output probability
  • binary:logitraw, logistic regression for binary classification, output score before logistic transformation
  • binary:hinge, hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.
  • count:poisson, poisson regression for count data, output mean of Poisson distribution
  • survival:cox, Cox regression for right censored survival time data (negative values are considered right censored).
  • survival:aft, Accelerated failure time model for censored survival time data. See Survival Analysis with Accelerated Failure Time for details.
  • aft_loss_distribution, Probability Density Function used by survival:aft objective and aft-nloglik metric.
  • multi:softmax, set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class (number of classes)
  • multi:softprob, same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata * nclass matrix. The result contains predicted probability of each data point belonging to each class.
  • rank:pairwise, Use LambdaMART to perform pairwise ranking where the pairwise loss is minimized
  • rank:ndcg, Use LambdaMART to perform list-wise ranking where Normalized Discounted Cumulative Gain (NDCG) is maximized
  • rank:map, Use LambdaMART to perform list-wise ranking where Mean Average Precision (MAP) is maximized
  • reg:gamma, gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed.
  • reg:tweedie, Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.

Weight balance

scale_pos_weight (default 1) controls the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider is sum(negative instances) / sum(positive instances).

from sklearn.model_selection import GridSearchCV
from xgboost.sklearn import XGBClassifier
from typing import Tuple

def find_best_xgboost_model(train_x: pd.DataFrame, train_y: pd.Series) -> Tuple[dict, float]:
    scale_pos_weight = (len(train_y) - train_y.sum()) / train_y.sum()

    param_test = {
            'max_depth': [1, 2, 4, 8],
            'learning_rate': [0.05, 0.06, 0.07],
            'n_estimators': [10, 50, 100]

    gsearch = GridSearchCV(estimator=XGBClassifier(
        tree_method = "hist",
        param_grid=param_test, scoring='roc_auc', n_jobs=-1, cv=8)

    gsearch.fit(train_x, train_y)

    return gsearch.best_params_, gsearch.best_score_
best_params, best_score = find_best_xgboost_model(train_x, train_y)

Using the xgboost model parameters, it predicts the probabilities of defaulting.

  • best_params_, best tuning parameters
  • train_x, training dataframe with loan details
  • train_y, default target column for training
  • test_x, testing dataframe with loan details
  • test_y, default target column for testing

The result is a series of probabilities whether loan entry will default or not and corresponding model’s AUC score

from sklearn.metrics import roc_auc_score

def xgboost_predict(best_params_: dict, train_x: pd.DataFrame, train_y: pd.Series, test_x: pd.DataFrame,
                    test_y: pd.Series) -> Tuple[list, float]:
    scale_pos_weight = (len(train_y) - train_y.sum()) / train_y.sum()
    xgb_model = XGBClassifier(objective='binary:logistic',

    xgb_model.fit(train_x, train_y)
    predicted_probabilities_ = xgb_model.predict_proba(test_x)[:, 1]
    auc_ = roc_auc_score(test_y, predicted_probabilities_)

    return predicted_probabilities_, auc_
predicted_probabilities, auc = xgboost_predict(best_params, train_x, train_y, test_x, test_y)
print("AUC: {}".format(auc))
AUC: 0.7356799122465589

Filters the original loan dataframe to just include the loans from the test dataframe and then it adds the predicted probabilities.

  • loans_df_, original loan dataframe
  • test_index, indices from the test dataframes
  • predicted_probabilities_, the probabilities forecasted by the XGBoost model

Returns the loans dataframe with predictions

import numpy as np

def prepare_test_with_predictions(loans_df_: pd.DataFrame, test_index: pd.Index, predicted_probabilities_: np.array)\
    loan_test_df = loans_df_.loc[test_index]
    loan_test_df['predicted_probabilities'] = predicted_probabilities_
    return loan_test_df
loans_with_predictions_df = prepare_test_with_predictions(data, test_x.index, predicted_probabilities)


5 rows × 41 columns


import seaborn as sns

sns.histplot(loans_with_predictions_df['predicted_probabilities'], stat='density')
<AxesSubplot:xlabel='predicted_probabilities', ylabel='Density'>


Based on actuals and predicted values3, it calculates their false positive rate (fpr), the true positive rate (tpr). It also returns the corresponding thresholds used as well as the value for the area under the curve.

actuals, series of actual values indicating whether the loan defaulted or not predicted_probabilities, series of predicted probabilities of the loan defaulting Return a unique series of false and true positive rates with corresponding series of thresholds and value for total area under the curve.

from sklearn.metrics import roc_curve, auc

def get_roc_auc_data(actuals: pd.Series, predicted_probabilities: pd.Series) -> Tuple[np.array, np.array, np.array, float]:
    fpr, tpr, thresholds = roc_curve(actuals, predicted_probabilities, pos_label=1)
    auc_score = auc(fpr, tpr)
    return fpr, tpr, thresholds, auc_score

fpr, tpr, thresholds, auc_score = get_roc_auc_data(loans_with_predictions_df['PaidLoan'], loans_with_predictions_df['predicted_probabilities'])