Spearman correlation

Spearman rank correlation

The Spearman correlation coefficient (or Spearman’s $\rho$) measures rank correlation between two variables. It is used to detect the existence of monotonic relationship between variables.

Assuming monotonicity, the Spearman’s $\rho$ will take values between $-1$ and $1$, representing completely opposite or identical ranks, respectively1, or, in other words, a negative monotonic relationship or a positive one.

Due to the dependance on ranks, the Spearman’s $\rho$ is used for ordinal value, although discrete and continous values are possible.

If we consider a dataset of size $n$, and $X_i, Y_i$ as the scores, we can then calculate the ranks as $\operatorname{R}({X_i}), \operatorname{R}({Y_i})$, and $\rho$ as

$$ r_s = \rho_{\operatorname{R}(X),\operatorname{R}(Y)} = \frac{\operatorname{cov}(\operatorname{R}(X), \operatorname{R}(Y))} {\sigma_{\operatorname{R}(X)} \sigma_{\operatorname{R}(Y)}}, $$

Here $\rho$ is the Pearson correlation coefficient, but applied to the rank variables, $\operatorname{cov}(\operatorname{R}(X), \operatorname{R}(Y))cov(R(X),R(Y))$ is the covariance of the rank variables, $\sigma_{\operatorname{R}(X)}$ and $\sigma_{\operatorname{R}(Y)}$ are the standard deviations of the rank variables.

If all the ranks are distinct integers, the simplified form can be applied

$$ r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}, $$

where $d_i = \operatorname{R}(X_i) - \operatorname{R}(Y_i)$ is the difference between the two ranks of each observation, $n$ is the number of observations.

The Spearman rank correlation test can be used when

  • The variables are quantitative or ordinal
  • Variables do not meet normality assumption
  • Variables have a monotonic relationship

Example

For the example we will use the scipy implementation and the iris dataset.

from sklearn import datasets
import pandas as pd
import scipy

# create ordinal categories
def categorize_petal_len(x):
    if x <= 1.6:
        return 'LOW'
    elif x <= 5.1:
        return 'AVERAGE'
    else:
        return 'HIGH'


iris = datasets.load_iris() 
df = iris |> .data |> pd.DataFrame
df["class"] = iris.target
df.columns = ['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True)
df['petal_len_cats'] = df['petal_len'].apply(categorize_petal_len)
df

sepal_lensepal_widpetal_lenpetal_widclasspetal_len_cats
05.13.51.40.20LOW
14.93.01.40.20LOW
24.73.21.30.20LOW
34.63.11.50.20LOW
45.03.61.40.20LOW
.....................
1456.73.05.22.32HIGH
1466.32.55.01.92AVERAGE
1476.53.05.22.02HIGH
1486.23.45.42.32HIGH
1495.93.05.11.82AVERAGE

150 rows × 6 columns

from scipy.stats import spearmanr

sp = (X, Y) -> spearmanr(df[X], df[Y])

correlation, p_value = sp('sepal_len', 'sepal_wid')

f"correlation = {correlation}, p-value = {p_value}" |> print
correlation = -0.166777658283235, p-value = 0.04136799424884587

  1. Assuming no repeated ranks. ↩︎