XGBoost
Introduction
XGBoost
1 is a popular regularizing gradient boosting framework.
Installation
In most systems, installing XGBoost
can be done simply by using pip
$ pip install xgboost
Example
Training XGBoost with the credit-bias dataset.
import pandas as pd
data = pd.read_csv("../data/credit-bias-train.zip")
data.head()
NewCreditCustomer | Amount | Interest | LoanDuration | Education | NrOfDependants | EmploymentDurationCurrentEmployer | IncomeFromPrincipalEmployer | IncomeFromPension | IncomeFromFamilyAllowance | ... | Mortgage | Other | Owner | Owner_with_encumbrance | Tenant | Entrepreneur | Fully | Partially | Retiree | Self_employed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | 2125.0 | 20.97 | 60 | 4.0 | 0.0 | 6.0 | 0.0 | 301.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | False | 3000.0 | 17.12 | 60 | 5.0 | 0.0 | 6.0 | 900.0 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | True | 9100.0 | 13.67 | 60 | 4.0 | 1.0 | 3.0 | 600.0 | 0.0 | 0.0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | True | 635.0 | 42.66 | 60 | 2.0 | 0.0 | 1.0 | 745.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
4 | False | 5000.0 | 24.52 | 60 | 4.0 | 1.0 | 5.0 | 1000.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
5 rows × 40 columns
X_df = data.drop('PaidLoan', axis=1)
y_df = data['PaidLoan']
y_df.describe()
count 58003
unique 2
top True
freq 29219
Name: PaidLoan, dtype: object
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X_df, y_df, test_size=0.25, random_state=42)
Hyperparameter estimation
Runs a grid search to find the tuning parameters that maxisimise the area under the curve (AUC). train_x
is the training data frame with loan details and train_y
is the default target column for training.
The method returns the best parameters and corresponding AUC score.
The objective
parameter2 specifies the learning task and the corresponding learning objective. Possible values include:
Objective function
reg:squarederror
, regression with squared loss.reg:squaredlogerror
, regression with squared log lossreg:logistic
, logistic regressionreg:pseudohubererror
, regression with Pseudo Huber loss, a twice differentiable alternative to absolute loss.binary:logistic
, logistic regression for binary classification, output probabilitybinary:logitraw
, logistic regression for binary classification, output score before logistic transformationbinary:hinge
, hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.count:poisson
, poisson regression for count data, output mean of Poisson distributionsurvival:cox
, Cox regression for right censored survival time data (negative values are considered right censored).survival:aft
, Accelerated failure time model for censored survival time data. See Survival Analysis with Accelerated Failure Time for details.aft_loss_distribution
, Probability Density Function used by survival:aft objective and aft-nloglik metric.multi:softmax
, set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class (number of classes)multi:softprob
, same as softmax, but output a vector ofndata * nclas
s, which can be further reshaped tondata * nclass
matrix. The result contains predicted probability of each data point belonging to each class.rank:pairwise
, Use LambdaMART to perform pairwise ranking where the pairwise loss is minimizedrank:ndcg
, Use LambdaMART to perform list-wise ranking where Normalized Discounted Cumulative Gain (NDCG) is maximizedrank:map
, Use LambdaMART to perform list-wise ranking where Mean Average Precision (MAP) is maximizedreg:gamma
, gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed.reg:tweedie
, Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.
Weight balance
scale_pos_weight
(default 1
) controls the balance of positive and negative weights, useful for unbalanced classes.
A typical value to consider is sum(negative instances) / sum(positive instances)
.
from sklearn.model_selection import GridSearchCV
from xgboost.sklearn import XGBClassifier
from typing import Tuple
def find_best_xgboost_model(train_x: pd.DataFrame, train_y: pd.Series) -> Tuple[dict, float]:
scale_pos_weight = (len(train_y) - train_y.sum()) / train_y.sum()
param_test = {
'max_depth': [1, 2, 4, 8],
'learning_rate': [0.05, 0.06, 0.07],
'n_estimators': [10, 50, 100]
}
gsearch = GridSearchCV(estimator=XGBClassifier(
use_label_encoder=False,
objective='binary:logistic',
scale_pos_weight=scale_pos_weight,
tree_method = "hist",
seed=27),
param_grid=param_test, scoring='roc_auc', n_jobs=-1, cv=8)
gsearch.fit(train_x, train_y)
return gsearch.best_params_, gsearch.best_score_
best_params, best_score = find_best_xgboost_model(train_x, train_y)
Using the xgboost model parameters, it predicts the probabilities of defaulting.
best_params_
, best tuning parameterstrain_x
, training dataframe with loan detailstrain_y
, default target column for trainingtest_x
, testing dataframe with loan detailstest_y
, default target column for testing
The result is a series of probabilities whether loan entry will default or not and corresponding model’s AUC score
from sklearn.metrics import roc_auc_score
def xgboost_predict(best_params_: dict, train_x: pd.DataFrame, train_y: pd.Series, test_x: pd.DataFrame,
test_y: pd.Series) -> Tuple[list, float]:
scale_pos_weight = (len(train_y) - train_y.sum()) / train_y.sum()
xgb_model = XGBClassifier(objective='binary:logistic',
scale_pos_weight=scale_pos_weight,
seed=27,
max_depth=best_params_['max_depth'],
learning_rate=best_params_['learning_rate'],
n_estimators=best_params_['n_estimators']
)
xgb_model.fit(train_x, train_y)
predicted_probabilities_ = xgb_model.predict_proba(test_x)[:, 1]
auc_ = roc_auc_score(test_y, predicted_probabilities_)
return predicted_probabilities_, auc_
predicted_probabilities, auc = xgboost_predict(best_params, train_x, train_y, test_x, test_y)
print("AUC: {}".format(auc))
AUC: 0.7356799122465589
Filters the original loan dataframe to just include the loans from the test dataframe and then it adds the predicted probabilities.
loans_df_
, original loan dataframetest_index
, indices from the test dataframespredicted_probabilities_
, the probabilities forecasted by the XGBoost model
Returns the loans dataframe with predictions
import numpy as np
def prepare_test_with_predictions(loans_df_: pd.DataFrame, test_index: pd.Index, predicted_probabilities_: np.array)\
->pd.DataFrame:
loan_test_df = loans_df_.loc[test_index]
loan_test_df['predicted_probabilities'] = predicted_probabilities_
return loan_test_df
loans_with_predictions_df = prepare_test_with_predictions(data, test_x.index, predicted_probabilities)
loans_with_predictions_df.head()
NewCreditCustomer | Amount | Interest | LoanDuration | Education | NrOfDependants | EmploymentDurationCurrentEmployer | IncomeFromPrincipalEmployer | IncomeFromPension | IncomeFromFamilyAllowance | ... | Other | Owner | Owner_with_encumbrance | Tenant | Entrepreneur | Fully | Partially | Retiree | Self_employed | predicted_probabilities | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
30299 | False | 530.0 | 10.68 | 36 | 4.0 | NaN | 5.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.641520 |
34126 | False | 530.0 | 21.57 | 24 | 4.0 | NaN | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.770486 |
11200 | False | 2300.0 | 15.62 | 36 | 4.0 | 0.0 | 6.0 | 1159.0 | 0.0 | 0.0 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0.748680 |
25133 | True | 530.0 | 27.36 | 36 | 4.0 | NaN | 6.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.469619 |
42758 | True | 4250.0 | 18.94 | 60 | 4.0 | NaN | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.527547 |
5 rows × 41 columns
Visualisation
import seaborn as sns
sns.histplot(loans_with_predictions_df['predicted_probabilities'], stat='density')
<AxesSubplot:xlabel='predicted_probabilities', ylabel='Density'>
ROC and AUC
Based on actuals and predicted values3, it calculates their false positive rate (fpr), the true positive rate (tpr). It also returns the corresponding thresholds used as well as the value for the area under the curve.
actuals, series of actual values indicating whether the loan defaulted or not predicted_probabilities, series of predicted probabilities of the loan defaulting Return a unique series of false and true positive rates with corresponding series of thresholds and value for total area under the curve.
from sklearn.metrics import roc_curve, auc
def get_roc_auc_data(actuals: pd.Series, predicted_probabilities: pd.Series) -> Tuple[np.array, np.array, np.array, float]:
fpr, tpr, thresholds = roc_curve(actuals, predicted_probabilities, pos_label=1)
auc_score = auc(fpr, tpr)
return fpr, tpr, thresholds, auc_score
fpr, tpr, thresholds, auc_score = get_roc_auc_data(loans_with_predictions_df['PaidLoan'], loans_with_predictions_df['predicted_probabilities'])
sns.histplot(fpr)
<AxesSubplot:ylabel='Count'>