# XGBoost

## Introduction

`XGBoost`

^{1} is a popular regularizing gradient boosting framework.

## Installation

In most systems, installing `XGBoost`

can be done simply by using `pip`

```
$ pip install xgboost
```

## Example

Training XGBoost with the credit-bias dataset.

```
import pandas as pd
data = pd.read_csv("../data/credit-bias-train.zip")
data.head()
```

NewCreditCustomer | Amount | Interest | LoanDuration | Education | NrOfDependants | EmploymentDurationCurrentEmployer | IncomeFromPrincipalEmployer | IncomeFromPension | IncomeFromFamilyAllowance | ... | Mortgage | Other | Owner | Owner_with_encumbrance | Tenant | Entrepreneur | Fully | Partially | Retiree | Self_employed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | False | 2125.0 | 20.97 | 60 | 4.0 | 0.0 | 6.0 | 0.0 | 301.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |

1 | False | 3000.0 | 17.12 | 60 | 5.0 | 0.0 | 6.0 | 900.0 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |

2 | True | 9100.0 | 13.67 | 60 | 4.0 | 1.0 | 3.0 | 600.0 | 0.0 | 0.0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |

3 | True | 635.0 | 42.66 | 60 | 2.0 | 0.0 | 1.0 | 745.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |

4 | False | 5000.0 | 24.52 | 60 | 4.0 | 1.0 | 5.0 | 1000.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |

5 rows × 40 columns

```
X_df = data.drop('PaidLoan', axis=1)
y_df = data['PaidLoan']
y_df.describe()
```

```
count 58003
unique 2
top True
freq 29219
Name: PaidLoan, dtype: object
```

```
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X_df, y_df, test_size=0.25, random_state=42)
```

### Hyperparameter estimation

Runs a grid search to find the tuning parameters that maxisimise the area under the curve (AUC). `train_x`

is the training data frame with loan details and `train_y`

is the default target column for training.
The method returns the best parameters and corresponding AUC score.

The `objective`

parameter^{2} specifies the learning task and the corresponding learning objective. Possible values include:

#### Objective function

`reg:squarederror`

, regression with squared loss.`reg:squaredlogerror`

, regression with squared log loss`reg:logistic`

, logistic regression`reg:pseudohubererror`

, regression with Pseudo Huber loss, a twice differentiable alternative to absolute loss.`binary:logistic`

, logistic regression for binary classification, output probability`binary:logitraw`

, logistic regression for binary classification, output score before logistic transformation`binary:hinge`

, hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.`count:poisson`

, poisson regression for count data, output mean of Poisson distribution`survival:cox`

, Cox regression for right censored survival time data (negative values are considered right censored).`survival:aft`

, Accelerated failure time model for censored survival time data. See Survival Analysis with Accelerated Failure Time for details.`aft_loss_distribution`

, Probability Density Function used by survival:aft objective and aft-nloglik metric.`multi:softmax`

, set XGBoost to do multiclass classification using the softmax objective, you also need to set_{num_class}(number of classes)`multi:softprob`

, same as softmax, but output a vector of`ndata * nclas`

s, which can be further reshaped to`ndata * nclass`

matrix. The result contains predicted probability of each data point belonging to each class.`rank:pairwise`

, Use LambdaMART to perform pairwise ranking where the pairwise loss is minimized`rank:ndcg`

, Use LambdaMART to perform list-wise ranking where Normalized Discounted Cumulative Gain (NDCG) is maximized`rank:map`

, Use LambdaMART to perform list-wise ranking where Mean Average Precision (MAP) is maximized`reg:gamma`

, gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed.`reg:tweedie`

, Tweedie regression with log-link. It might be useful,*e.g.*, for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.

#### Weight balance

`scale_pos_weight`

(default `1`

) controls the balance of *positive* and *negative* weights, useful for unbalanced classes.
A typical value to consider is `sum(negative instances) / sum(positive instances)`

.

```
from sklearn.model_selection import GridSearchCV
from xgboost.sklearn import XGBClassifier
from typing import Tuple
def find_best_xgboost_model(train_x: pd.DataFrame, train_y: pd.Series) -> Tuple[dict, float]:
scale_pos_weight = (len(train_y) - train_y.sum()) / train_y.sum()
param_test = {
'max_depth': [1, 2, 4, 8],
'learning_rate': [0.05, 0.06, 0.07],
'n_estimators': [10, 50, 100]
}
gsearch = GridSearchCV(estimator=XGBClassifier(
use_label_encoder=False,
objective='binary:logistic',
scale_pos_weight=scale_pos_weight,
tree_method = "hist",
seed=27),
param_grid=param_test, scoring='roc_auc', n_jobs=-1, cv=8)
gsearch.fit(train_x, train_y)
return gsearch.best_params_, gsearch.best_score_
best_params, best_score = find_best_xgboost_model(train_x, train_y)
```

Using the xgboost model parameters, it predicts the probabilities of defaulting.

`best_params_`

, best tuning parameters`train_x`

, training dataframe with loan details`train_y`

, default target column for training`test_x`

, testing dataframe with loan details`test_y`

, default target column for testing

The result is a series of probabilities whether loan entry will default or not and corresponding model’s AUC score

```
from sklearn.metrics import roc_auc_score
def xgboost_predict(best_params_: dict, train_x: pd.DataFrame, train_y: pd.Series, test_x: pd.DataFrame,
test_y: pd.Series) -> Tuple[list, float]:
scale_pos_weight = (len(train_y) - train_y.sum()) / train_y.sum()
xgb_model = XGBClassifier(objective='binary:logistic',
scale_pos_weight=scale_pos_weight,
seed=27,
max_depth=best_params_['max_depth'],
learning_rate=best_params_['learning_rate'],
n_estimators=best_params_['n_estimators']
)
xgb_model.fit(train_x, train_y)
predicted_probabilities_ = xgb_model.predict_proba(test_x)[:, 1]
auc_ = roc_auc_score(test_y, predicted_probabilities_)
return predicted_probabilities_, auc_
predicted_probabilities, auc = xgboost_predict(best_params, train_x, train_y, test_x, test_y)
print("AUC: {}".format(auc))
```

```
AUC: 0.7356799122465589
```

Filters the original loan dataframe to just include the loans from the test dataframe and then it adds the predicted probabilities.

`loans_df_`

, original loan dataframe`test_index`

, indices from the test dataframes`predicted_probabilities_`

, the probabilities forecasted by the XGBoost model

Returns the loans dataframe with predictions

```
import numpy as np
def prepare_test_with_predictions(loans_df_: pd.DataFrame, test_index: pd.Index, predicted_probabilities_: np.array)\
->pd.DataFrame:
loan_test_df = loans_df_.loc[test_index]
loan_test_df['predicted_probabilities'] = predicted_probabilities_
return loan_test_df
loans_with_predictions_df = prepare_test_with_predictions(data, test_x.index, predicted_probabilities)
loans_with_predictions_df.head()
```

NewCreditCustomer | Amount | Interest | LoanDuration | Education | NrOfDependants | EmploymentDurationCurrentEmployer | IncomeFromPrincipalEmployer | IncomeFromPension | IncomeFromFamilyAllowance | ... | Other | Owner | Owner_with_encumbrance | Tenant | Entrepreneur | Fully | Partially | Retiree | Self_employed | predicted_probabilities | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

30299 | False | 530.0 | 10.68 | 36 | 4.0 | NaN | 5.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.641520 |

34126 | False | 530.0 | 21.57 | 24 | 4.0 | NaN | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.770486 |

11200 | False | 2300.0 | 15.62 | 36 | 4.0 | 0.0 | 6.0 | 1159.0 | 0.0 | 0.0 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0.748680 |

25133 | True | 530.0 | 27.36 | 36 | 4.0 | NaN | 6.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0.469619 |

42758 | True | 4250.0 | 18.94 | 60 | 4.0 | NaN | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.527547 |

5 rows × 41 columns

## Visualisation

```
import seaborn as sns
sns.histplot(loans_with_predictions_df['predicted_probabilities'], stat='density')
```

```
<AxesSubplot:xlabel='predicted_probabilities', ylabel='Density'>
```

## ROC and AUC

Based on actuals and predicted values^{3}, it calculates their false positive rate (fpr), the true positive rate (tpr). It also returns the corresponding thresholds used as well as the value for the area under the curve.

actuals, series of actual values indicating whether the loan defaulted or not predicted_probabilities, series of predicted probabilities of the loan defaulting Return a unique series of false and true positive rates with corresponding series of thresholds and value for total area under the curve.

```
from sklearn.metrics import roc_curve, auc
def get_roc_auc_data(actuals: pd.Series, predicted_probabilities: pd.Series) -> Tuple[np.array, np.array, np.array, float]:
fpr, tpr, thresholds = roc_curve(actuals, predicted_probabilities, pos_label=1)
auc_score = auc(fpr, tpr)
return fpr, tpr, thresholds, auc_score
fpr, tpr, thresholds, auc_score = get_roc_auc_data(loans_with_predictions_df['PaidLoan'], loans_with_predictions_df['predicted_probabilities'])
sns.histplot(fpr)
```

```
<AxesSubplot:ylabel='Count'>
```