Spearman correlation

Spearman rank correlation

The Spearman correlation coefficient (or Spearman’s \(\rho\)) measures rank correlation between two variables. It is used to detect the existence of monotonic relationship between variables.

Assuming monotonicity, the Spearman’s \(\rho\) will take values between \(-1\) and \(1\), representing completely opposite or identical ranks, respectively¹, or, in other words, a negative monotonic relationship or a positive one.

Due to the dependance on ranks, the Spearman’s \(\rho\) is used for ordinal value, although discrete and continous values are possible.

If we consider a dataset of size \(n\), and \(X_i, Y_i\) as the scores, we can then calculate the ranks as \(\operatorname{R}({X_i}), \operatorname{R}({Y_i})\), and \(\rho\) as

\[ r_s = \rho_{\operatorname{R}(X),\operatorname{R}(Y)} = \frac{\operatorname{cov}(\operatorname{R}(X), \operatorname{R}(Y))} {\sigma_{\operatorname{R}(X)} \sigma_{\operatorname{R}(Y)}}, \]

Here \(\rho\) is the Pearson correlation coefficient, but applied to the rank variables, \(\operatorname{cov}(\operatorname{R}(X), \operatorname{R}(Y))cov(R(X),R(Y))\) is the covariance of the rank variables, \(\sigma_{\operatorname{R}(X)}\) and \(\sigma_{\operatorname{R}(Y)}\) are the standard deviations of the rank variables.

If all the ranks are distinct integers, the simplified form can be applied

\[ r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}, \]

where \(d_i = \operatorname{R}(X_i) - \operatorname{R}(Y_i)\) is the difference between the two ranks of each observation, \(n\) is the number of observations.

The Spearman rank correlation test can be used when

The variables are quantitative or ordinal
Variables do not meet normality assumption
Variables have a monotonic relationship

Example

For the example we will use the scipy implementation and the iris dataset.

from sklearn import datasets
import pandas as pd
import scipy

# create ordinal categories
def categorize_petal_len(x):
    if x <= 1.6:
        return 'LOW'
    elif x <= 5.1:
        return 'AVERAGE'
    else:
        return 'HIGH'


iris = datasets.load_iris() 
df = iris |> .data |> pd.DataFrame
df["class"] = iris.target
df.columns = ['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True)
df['petal_len_cats'] = df['petal_len'].apply(categorize_petal_len)
df

	sepal_len	sepal_wid	petal_len	petal_wid	class	petal_len_cats
0	5.1	3.5	1.4	0.2	0	LOW
1	4.9	3.0	1.4	0.2	0	LOW
2	4.7	3.2	1.3	0.2	0	LOW
3	4.6	3.1	1.5	0.2	0	LOW
4	5.0	3.6	1.4	0.2	0	LOW
…	…	…	…	…	…	…
145	6.7	3.0	5.2	2.3	2	HIGH
146	6.3	2.5	5.0	1.9	2	AVERAGE
147	6.5	3.0	5.2	2.0	2	HIGH
148	6.2	3.4	5.4	2.3	2	HIGH
149	5.9	3.0	5.1	1.8	2	AVERAGE

150 rows × 6 columns

from scipy.stats import spearmanr

sp = (X, Y) -> spearmanr(df[X], df[Y])

correlation, p_value = sp('sepal_len', 'sepal_wid')

f"correlation = {correlation}, p-value = {p_value}" |> print

correlation = -0.166777658283235, p-value = 0.04136799424884587

Footnotes

Assuming no repeated ranks.↩︎