Spearman correlation

Spearman rank correlation

The Spearman correlation coefficient (or Spearman’s \(\rho\)) measures rank correlation between two variables. It is used to detect the existence of monotonic relationship between variables.

Assuming monotonicity, the Spearman’s \(\rho\) will take values between \(-1\) and \(1\), representing completely opposite or identical ranks, respectively1, or, in other words, a negative monotonic relationship or a positive one.

Due to the dependance on ranks, the Spearman’s \(\rho\) is used for ordinal value, although discrete and continous values are possible.

If we consider a dataset of size \(n\), and \(X_i, Y_i\) as the scores, we can then calculate the ranks as \(\operatorname{R}({X_i}), \operatorname{R}({Y_i})\), and \(\rho\) as

\[ r_s = \rho_{\operatorname{R}(X),\operatorname{R}(Y)} = \frac{\operatorname{cov}(\operatorname{R}(X), \operatorname{R}(Y))} {\sigma_{\operatorname{R}(X)} \sigma_{\operatorname{R}(Y)}}, \]

Here \(\rho\) is the Pearson correlation coefficient, but applied to the rank variables, \(\operatorname{cov}(\operatorname{R}(X), \operatorname{R}(Y))cov(R(X),R(Y))\) is the covariance of the rank variables, \(\sigma_{\operatorname{R}(X)}\) and \(\sigma_{\operatorname{R}(Y)}\) are the standard deviations of the rank variables.

If all the ranks are distinct integers, the simplified form can be applied

\[ r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}, \]

where \(d_i = \operatorname{R}(X_i) - \operatorname{R}(Y_i)\) is the difference between the two ranks of each observation, \(n\) is the number of observations.

The Spearman rank correlation test can be used when

  • The variables are quantitative or ordinal
  • Variables do not meet normality assumption
  • Variables have a monotonic relationship

Example

For the example we will use the scipy implementation and the iris dataset.

from sklearn import datasets
import pandas as pd
import scipy

# create ordinal categories
def categorize_petal_len(x):
    if x <= 1.6:
        return 'LOW'
    elif x <= 5.1:
        return 'AVERAGE'
    else:
        return 'HIGH'


iris = datasets.load_iris() 
df = iris |> .data |> pd.DataFrame
df["class"] = iris.target
df.columns = ['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True)
df['petal_len_cats'] = df['petal_len'].apply(categorize_petal_len)
df
sepal_len sepal_wid petal_len petal_wid class petal_len_cats
0 5.1 3.5 1.4 0.2 0 LOW
1 4.9 3.0 1.4 0.2 0 LOW
2 4.7 3.2 1.3 0.2 0 LOW
3 4.6 3.1 1.5 0.2 0 LOW
4 5.0 3.6 1.4 0.2 0 LOW
145 6.7 3.0 5.2 2.3 2 HIGH
146 6.3 2.5 5.0 1.9 2 AVERAGE
147 6.5 3.0 5.2 2.0 2 HIGH
148 6.2 3.4 5.4 2.3 2 HIGH
149 5.9 3.0 5.1 1.8 2 AVERAGE

150 rows × 6 columns

from scipy.stats import spearmanr

sp = (X, Y) -> spearmanr(df[X], df[Y])

correlation, p_value = sp('sepal_len', 'sepal_wid')

f"correlation = {correlation}, p-value = {p_value}" |> print
correlation = -0.166777658283235, p-value = 0.04136799424884587

Footnotes

  1. Assuming no repeated ranks.↩︎