For each feature \(k=1,\dots,p\) a score \(s_{ijk}\) is calculated. A quantity \(\delta_{ijk}\) is also calculated having possible values \(\{0,1\}\) depending on whether the variables \(x_i\) and \(x_j\) can be compared or not (/e.g./ if they have different types).
A special case1 for when no missing values exist can be formulated as the mean of the Gower similarity scores, that is:
The score \(s_{ijk}\) calculation will depend on the type of variable and below we will see some examples.
This similarity score will take values between \(0 \leq s_{ijk} \leq 1\) with \(0\) representing maximum similarity and \(1\) no similarity.
Scoring
Numerical variables
For numerical variables the score can be calculated as
\[
s_{ijk} = 1 - \frac{|x_{ik}-x_{jk}|}{R_k}.
\]
This is simply a L1 distance between the two values normalised by a quantity \(R_k\). The quantity \(R_k\) refers to the range of feature (population /or/ sample).
Categorical variables
For categorical variables we will use following score:
\[
s_{ijk} = 1\{x_{ik}=x_{jk}\}
\]
This score will be \(1\) if the categories are the same and \(0\) if they are not.
In reality the score \(S_{\text{Gower}}(x_i, x_j)\) will be a /similarity score/ taking values between \(1\) (for equal points) and \(0\) for extremely dissimilar points. In order to turn this value into a /distance metric/ we can convert it using (for instance)
Let’s visualise how the choice of range can affect the scoring, if can set it arbitrarily. First let’s pick two random points, \(x_1=(30, M)\) and \(x_2=(35, F)\).
We will vary the bounds from a \(15\leq x_{min}<30\) and \(35< x_{max} \leq 100\).
import numpy as npx1 = (30, 'M')x2 = (35, 'F')bmin = np.linspace(15, 30, num=1000)bmax = np.linspace(36, 100, num=1000)scores_min = [distance(score(x1, x2, 100-bm)) for bm in bmin]scores_max = [distance(score(x1, x2, bm-15)) for bm in bmax]
Let’s try with more separated points
x1 = (16, 'M')x2 = (90, 'F')bmin = np.linspace(15, 16, num=1000, endpoint=False)bmax = np.linspace(91, 100, num=1000)scores_min = [distance(score(x1, x2, 100-bm)) for bm in bmin]scores_max = [distance(score(x1, x2, bm-16)) for bm in bmax]
Varying range directly
We will now try to see how the distance between two point comparisons (very close, very far) changes when varying the range directly. We will choose two sets of points, \(x_1=(1000, M), x_2=(1001, F)\) and \(x_1=(500, M), x_2=(50000, F)\). The range will vary between
\[
\max(x_1, x_2)-\min(x_1, x_2)<R<100000.
\]
We are also interested on the weight the categorical variable will have on the final distance with varying bounds, so we will also calculate them for an alternative \(x_2'=(1001, M)\) anf \(x_2'=(50000, M)\).
For the first set of points we will have:
x1 = (1000.0, 'M')x2 = (1001.0, 'F')MAX_RANGE =100000R = np.linspace(max(x1[0], x2[0])-min(x1[0], x2[0]), MAX_RANGE, num=100000)distances_M = [distance(score(x1, x2, i)) for i in R]distances_F = [distance(score(x1, (x2[0], 'M'), i)) for i in R]
And for far away points we will have:
x1 = (500.0, 'M')x2 = (50000.0, 'F')MAX_RANGE =100000R = np.linspace(max(x1[0], x2[0])-min(x1[0], x2[0]), MAX_RANGE, num=100000)distances_M = [distance(score(x1, x2, i)) for i in R]distances_F = [distance(score(x1, (x2[0], 'M'), i)) for i in R]
Categorical impact
Predictably, in the scenario where we calculate the mean of the Gower distances, for a point \(x\) with \(p\) features, \(x=(x_{1},\dots,x_{p})\), the contribution to the final distance of a categorial variable will be either \(0\) or \(1/p\), regardless of the range.
Missing range
For the previous examples the range \(R\) was available, but how to calculate the mixed distance when the numerical range is absent?
A possible way is to use scale each feature using unit scaling:
\[
f_u(x) = \frac{x}{||x||}
\]
We will visualise how a difference varying from \(-1000 \leq \delta \leq 1000\) varies with the \(f_u(\delta)\) transformation.
def f_unit(x):return np.exp(x)/(np.exp(x)+1.0)def ilogit(eta):return1.0-1.0/(np.exp(eta)+1)delta = np.linspace(-10, 10, num=20000)transformed = [ilogit(abs(x)) for x in delta]
Final plot
Text(0.5, 1.0, 'Unit scaling transformation')
References
Gower, John C. 1971. “A General Coefficient of Similarity and Some of Its Properties.”Biometrics, 857–71.