Data processing

Normalisation

How to normalise a variable? Let’s assume a uniform random variable distributed as

\[ x \sim \mathcal{U}\left(0,1000\right) \]
import numpy as np

np.random.seed(1234) # for reproducibility

x = np.random.rand(1000) * 1000
import matplotlib.pyplot as plt

from plotutils import *

plt.subplot(1, 2, 1)
plt.scatter(range(1000), x, c=colours[0], edgecolor=edges[0])
plt.subplot(1, 2, 2)
plt.hist(x, bins=25, facecolor=colours[0], edgecolor=edges[0])

plt.show()
_images/data-processing_3_0.png

Normalise with numpy

We can normalise it using numpy in several ways.

We can for instance calculate the vector’s norm and divide the vector by it.

normalised = x / np.linalg.norm(x)
plt.subplot(1, 2, 1)
plt.scatter(range(1000), normalised, c=colours[1], edgecolor=edges[1])
plt.subplot(1, 2, 2)
plt.hist(normalised, bins=25, facecolor=colours[1], edgecolor=edges[1])

plt.show()
_images/data-processing_6_0.png

Normalise with sklearn

Normaliser

Alternatively, we could use sklearn ’s normalize method:

from sklearn.preprocessing import normalize

normalised = normalize(x[:, np.newaxis], axis=0).ravel()

The results would be virtually the same:

np.mean((x / np.linalg.norm(x)) - normalize(x[:, np.newaxis], axis=0).ravel())
0.0

MinMaxScaler

Yet another option is to use sklearn ’s MinMaxScaler .

from sklearn.preprocessing import MinMaxScaler
import pandas as pd

scaler = MinMaxScaler()

normalised = scaler.fit_transform(x.reshape(-1, 1))
plt.subplot(1, 2, 1)
plt.scatter(range(1000), normalised, c=colours[1], edgecolor=edges[1])
plt.subplot(1, 2, 2)
plt.hist(normalised, bins=25, facecolor=colours[1], edgecolor=edges[1])

plt.show()
_images/data-processing_13_0.png

Scaling

In some situations, you might be interested in scaling the data.

For \(n\) -dimensional datasets, scaling will be applied independently to each individual dimension. For each dimension, the new value, \(z\) will be given by

\[ z = \frac{x-\mu}{\sigma}, \]

where \(\mu\) is calculated with

\[ \mu = \frac{1}{N}\sum_{i=1}^N\left(x_i\right), \]

and the standard deviation with

\[ \sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N\left(x_i-\mu\right)^2} \]

Let’s start with an example with an \(\mathbb{R}^2\) vector, with

\[\begin{split} \mathbf{x} = \left(x_1,x_2\right),\\ x_1 \sim \mathcal{U}\left(0, 1000\right)\\ x_2 \sim \mathcal{N}\left(100, 40\right) \end{split}\]

The StandardScaler in sklearn shapes the data into an approximated Gaussian with \(\mathcal{N}\left(0,1\right)\) , that is \(\mu=0,\sigma=1\) .

import numpy as np

N = 1000
x = newarr = np.empty(2*N).reshape(1000,2)
x[:,0] = np.random.rand(N) * 1000
x[:,1] = np.random.normal(100, 40, N)
x
array([[401.10640866, 116.40677803],
       [930.61440167, 128.31437756],
       [515.33614695,  55.9681208 ],
       ...,
       [865.67164613,  72.68390215],
       [954.21650295,  63.18354212],
       [879.56475817, 137.55937995]])

If we calculate the mean and standard deviation for each column individually, we get

print(f"μ(x1)={np.mean(x[:,0])}, σ(x1)={np.std(x[:,0])}")
print(f"μ(x2)={np.mean(x[:,1])}, σ(x2)={np.std(x[:,1])}")
μ(x1)=482.3437960527782, σ(x1)=295.79104190514
μ(x2)=103.09458701764348, σ(x2)=39.278136656633386

And we can visually inspect the data.

import matplotlib.pyplot as plt

plt.subplot(1, 2, 1)
plt.scatter(x[:,0], x[:,1], c="lightgray", edgecolor="gray")
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
plt.subplot(1, 2, 2)
plt.hist(x[:,0], bins=25, label="$x_1$", facecolor=colours[0], alpha=0.5)
plt.hist(x[:,1], label="$x_2$", facecolor=colours[1], alpha=0.5)
plt.xlabel("value")
plt.ylabel("frequency")
plt.legend(loc="upper right")
plt.show()
_images/data-processing_20_0.png
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled = scaler.fit_transform(x)

Let’s verify the mean and standard deviation after the transformation.

print(f"μ(scaled1)={np.mean(scaled[:,0])}, σ(scaled1)={np.std(scaled[:,0])}")
print(f"μ(scaled2)={np.mean(scaled[:,1])}, σ(scaled2)={np.std(scaled[:,1])}")
μ(scaled1)=-5.346834086594754e-16, σ(scaled1)=0.9999999999999994
μ(scaled2)=-3.488764832582092e-15, σ(scaled2)=0.9999999999999999
import matplotlib.pyplot as plt

plt.subplot(1, 2, 1)
plt.scatter(scaled[:,0], scaled[:,1], c="lightgray", edgecolor="gray")
plt.xlabel("scaled $x_1$")
plt.ylabel("scaled $x_2$")
plt.subplot(1, 2, 2)
plt.hist(scaled[:,0], bins=25, label="scaled $x_1$", facecolor=colours[0], alpha=0.5)
plt.hist(scaled[:,1], label="scaled $x_2$", facecolor=colours[1], alpha=0.5)
plt.xlabel("value")
plt.ylabel("frequency")
plt.legend(loc="upper right")
plt.show()
_images/data-processing_24_0.png