Spearman correlation
Spearman rank correlation
The Spearman correlation coefficient (or Spearman’s $\rho$) measures rank correlation between two variables. It is used to detect the existence of monotonic relationship between variables.
Assuming monotonicity, the Spearman’s $\rho$ will take values between $-1$ and $1$, representing completely opposite or identical ranks, respectively1, or, in other words, a negative monotonic relationship or a positive one.
Due to the dependance on ranks, the Spearman’s $\rho$ is used for ordinal value, although discrete and continous values are possible.
If we consider a dataset of size $n$, and $X_i, Y_i$ as the scores, we can then calculate the ranks as $\operatorname{R}({X_i}), \operatorname{R}({Y_i})$, and $\rho$ as
$$ r_s = \rho_{\operatorname{R}(X),\operatorname{R}(Y)} = \frac{\operatorname{cov}(\operatorname{R}(X), \operatorname{R}(Y))} {\sigma_{\operatorname{R}(X)} \sigma_{\operatorname{R}(Y)}}, $$
Here $\rho$ is the Pearson correlation coefficient, but applied to the rank variables, $\operatorname{cov}(\operatorname{R}(X), \operatorname{R}(Y))cov(R(X),R(Y))$ is the covariance of the rank variables, $\sigma_{\operatorname{R}(X)}$ and $\sigma_{\operatorname{R}(Y)}$ are the standard deviations of the rank variables.
If all the ranks are distinct integers, the simplified form can be applied
$$ r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}, $$
where $d_i = \operatorname{R}(X_i) - \operatorname{R}(Y_i)$ is the difference between the two ranks of each observation, $n$ is the number of observations.
The Spearman rank correlation test can be used when
- The variables are quantitative or ordinal
- Variables do not meet normality assumption
- Variables have a monotonic relationship
Example
For the example we will use the scipy implementation and the iris dataset.
from sklearn import datasets
import pandas as pd
import scipy
# create ordinal categories
def categorize_petal_len(x):
if x <= 1.6:
return 'LOW'
elif x <= 5.1:
return 'AVERAGE'
else:
return 'HIGH'
iris = datasets.load_iris()
df = iris |> .data |> pd.DataFrame
df["class"] = iris.target
df.columns = ['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True)
df['petal_len_cats'] = df['petal_len'].apply(categorize_petal_len)
df
sepal_len | sepal_wid | petal_len | petal_wid | class | petal_len_cats | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 | LOW |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 | LOW |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 | LOW |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 | LOW |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 | LOW |
... | ... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | 2 | HIGH |
146 | 6.3 | 2.5 | 5.0 | 1.9 | 2 | AVERAGE |
147 | 6.5 | 3.0 | 5.2 | 2.0 | 2 | HIGH |
148 | 6.2 | 3.4 | 5.4 | 2.3 | 2 | HIGH |
149 | 5.9 | 3.0 | 5.1 | 1.8 | 2 | AVERAGE |
150 rows × 6 columns
from scipy.stats import spearmanr
sp = (X, Y) -> spearmanr(df[X], df[Y])
correlation, p_value = sp('sepal_len', 'sepal_wid')
f"correlation = {correlation}, p-value = {p_value}" |> print
correlation = -0.166777658283235, p-value = 0.04136799424884587
Assuming no repeated ranks. ↩︎