Spearman correlation
Spearman rank correlation
The Spearman correlation coefficient (or Spearman’s \(\rho\)) measures rank correlation between two variables. It is used to detect the existence of monotonic relationship between variables.
Assuming monotonicity, the Spearman’s \(\rho\) will take values between \(-1\) and \(1\), representing completely opposite or identical ranks, respectively1, or, in other words, a negative monotonic relationship or a positive one.
Due to the dependance on ranks, the Spearman’s \(\rho\) is used for ordinal value, although discrete and continous values are possible.
If we consider a dataset of size \(n\), and \(X_i, Y_i\) as the scores, we can then calculate the ranks as \(\operatorname{R}({X_i}), \operatorname{R}({Y_i})\), and \(\rho\) as
\[ r_s = \rho_{\operatorname{R}(X),\operatorname{R}(Y)} = \frac{\operatorname{cov}(\operatorname{R}(X), \operatorname{R}(Y))} {\sigma_{\operatorname{R}(X)} \sigma_{\operatorname{R}(Y)}}, \]
Here \(\rho\) is the Pearson correlation coefficient, but applied to the rank variables, \(\operatorname{cov}(\operatorname{R}(X), \operatorname{R}(Y))cov(R(X),R(Y))\) is the covariance of the rank variables, \(\sigma_{\operatorname{R}(X)}\) and \(\sigma_{\operatorname{R}(Y)}\) are the standard deviations of the rank variables.
If all the ranks are distinct integers, the simplified form can be applied
\[ r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}, \]
where \(d_i = \operatorname{R}(X_i) - \operatorname{R}(Y_i)\) is the difference between the two ranks of each observation, \(n\) is the number of observations.
The Spearman rank correlation test can be used when
- The variables are quantitative or ordinal
- Variables do not meet normality assumption
- Variables have a monotonic relationship
Example
For the example we will use the scipy implementation and the iris dataset.
from sklearn import datasets
import pandas as pd
import scipy
# create ordinal categories
def categorize_petal_len(x):
if x <= 1.6:
return 'LOW'
elif x <= 5.1:
return 'AVERAGE'
else:
return 'HIGH'
iris = datasets.load_iris()
df = iris |> .data |> pd.DataFrame
df["class"] = iris.target
df.columns = ['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True)
df['petal_len_cats'] = df['petal_len'].apply(categorize_petal_len)
df
sepal_len | sepal_wid | petal_len | petal_wid | class | petal_len_cats | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 | LOW |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 | LOW |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 | LOW |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 | LOW |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 | LOW |
… | … | … | … | … | … | … |
145 | 6.7 | 3.0 | 5.2 | 2.3 | 2 | HIGH |
146 | 6.3 | 2.5 | 5.0 | 1.9 | 2 | AVERAGE |
147 | 6.5 | 3.0 | 5.2 | 2.0 | 2 | HIGH |
148 | 6.2 | 3.4 | 5.4 | 2.3 | 2 | HIGH |
149 | 5.9 | 3.0 | 5.1 | 1.8 | 2 | AVERAGE |
150 rows × 6 columns
from scipy.stats import spearmanr
sp = (X, Y) -> spearmanr(df[X], df[Y])
correlation, p_value = sp('sepal_len', 'sepal_wid')
f"correlation = {correlation}, p-value = {p_value}" |> print
correlation = -0.166777658283235, p-value = 0.04136799424884587
Footnotes
Assuming no repeated ranks.↩︎