Model performance metrics

Difference in Positive Proportions in Predicted Labels (DPPL)

DPPL (Difference in Positive Proportions in Predicted Labels) is a metric used to evaluate the performance of machine learning models in imbalanced datasets. It measures the difference in the proportion of positive predictions made by the model between the minority class and the majority class.

A low value of DPPL indicates that the model is able to make similar positive predictions for instances of the minority class and instances of the majority class, which is desirable in imbalanced datasets where it is important to ensure that the minority class is not overlooked.

It is a classification-specific metric, as it is only applicable to models that perform binary or multi-class classification. It assesses the model’s ability to make positive predictions for instances of the minority class similar to the instances of the majority class, which is important in imbalanced datasets where the minority class is often under-represented.

DPPL is also a threshold-independent metric, which means that it does not depend on the specific threshold used to make binary predictions. This makes it useful in situations where the threshold used to make predictions may need to be adjusted depending on the specific use case.

The formula for the DPPL metric is as follows:

$$ DPPL = \vert p_{m} - p_M \vert $$

where:

  • $p_m$ is the proportion of positive predictions made by the model for instances of the minority class
  • $p_M$ is the proportion of positive predictions made by the model for instances of the majority class

It calculates the absolute difference between the proportion of positive predictions made by the model for the minority class and the majority class.

```python
import pandas as pd
import numpy as np

# assume data is a pandas dataframe with the columns 'label' and 'prediction'

# calculate the proportion of positive predictions for the minority class
p_minority = data[data['label'] == minority_class]['prediction'].mean()

# calculate the proportion of positive predictions for the majority class
p_majority = data[data['label'] == majority_class]['prediction'].mean()

# calculate DPPL
DPPL = abs(p_minority - p_majority)

print("DPPL: ", DPPL)