# Data processing ¶

## Normalisation ¶

How to normalise a variable? Let’s assume a uniform random variable distributed as

$x \sim \mathcal{U}\left(0,1000\right)$
import numpy as np

np.random.seed(1234) # for reproducibility

x = np.random.rand(1000) * 1000

import matplotlib.pyplot as plt

from plotutils import *

plt.subplot(1, 2, 1)
plt.scatter(range(1000), x, c=colours[0], edgecolor=edges[0])
plt.subplot(1, 2, 2)
plt.hist(x, bins=25, facecolor=colours[0], edgecolor=edges[0])

plt.show()


### Normalise with  numpy  ¶

We can normalise it using  numpy  in several ways.

We can for instance calculate the vector’s norm and divide the vector by it.

normalised = x / np.linalg.norm(x)

plt.subplot(1, 2, 1)
plt.scatter(range(1000), normalised, c=colours[1], edgecolor=edges[1])
plt.subplot(1, 2, 2)
plt.hist(normalised, bins=25, facecolor=colours[1], edgecolor=edges[1])

plt.show()


### Normalise with  sklearn  ¶

#### Normaliser ¶

Alternatively, we could use  sklearn  ’s  normalize  method:

from sklearn.preprocessing import normalize

normalised = normalize(x[:, np.newaxis], axis=0).ravel()


The results would be virtually the same:

np.mean((x / np.linalg.norm(x)) - normalize(x[:, np.newaxis], axis=0).ravel())

0.0


####  MinMaxScaler  ¶

Yet another option is to use  sklearn  ’s  MinMaxScaler  .

from sklearn.preprocessing import MinMaxScaler
import pandas as pd

scaler = MinMaxScaler()

normalised = scaler.fit_transform(x.reshape(-1, 1))

plt.subplot(1, 2, 1)
plt.scatter(range(1000), normalised, c=colours[1], edgecolor=edges[1])
plt.subplot(1, 2, 2)
plt.hist(normalised, bins=25, facecolor=colours[1], edgecolor=edges[1])

plt.show()


## Scaling ¶

In some situations, you might be interested in scaling the data.

For $$n$$ -dimensional datasets, scaling will be applied independently to each individual dimension. For each dimension, the new value, $$z$$ will be given by

$z = \frac{x-\mu}{\sigma},$

where $$\mu$$ is calculated with

$\mu = \frac{1}{N}\sum_{i=1}^N\left(x_i\right),$

and the standard deviation with

$\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N\left(x_i-\mu\right)^2}$

Let’s start with an example with an $$\mathbb{R}^2$$ vector, with

$\begin{split} \mathbf{x} = \left(x_1,x_2\right),\\ x_1 \sim \mathcal{U}\left(0, 1000\right)\\ x_2 \sim \mathcal{N}\left(100, 40\right) \end{split}$

The  StandardScaler  in  sklearn  shapes the data into an approximated Gaussian with $$\mathcal{N}\left(0,1\right)$$ , that is $$\mu=0,\sigma=1$$ .

import numpy as np

N = 1000
x = newarr = np.empty(2*N).reshape(1000,2)
x[:,0] = np.random.rand(N) * 1000
x[:,1] = np.random.normal(100, 40, N)
x

array([[401.10640866, 116.40677803],
[930.61440167, 128.31437756],
[515.33614695,  55.9681208 ],
...,
[865.67164613,  72.68390215],
[954.21650295,  63.18354212],
[879.56475817, 137.55937995]])


If we calculate the mean and standard deviation for each column individually, we get

print(f"μ(x1)={np.mean(x[:,0])}, σ(x1)={np.std(x[:,0])}")
print(f"μ(x2)={np.mean(x[:,1])}, σ(x2)={np.std(x[:,1])}")

μ(x1)=482.3437960527782, σ(x1)=295.79104190514
μ(x2)=103.09458701764348, σ(x2)=39.278136656633386


And we can visually inspect the data.

import matplotlib.pyplot as plt

plt.subplot(1, 2, 1)
plt.scatter(x[:,0], x[:,1], c="lightgray", edgecolor="gray")
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
plt.subplot(1, 2, 2)
plt.hist(x[:,0], bins=25, label="$x_1$", facecolor=colours[0], alpha=0.5)
plt.hist(x[:,1], label="$x_2$", facecolor=colours[1], alpha=0.5)
plt.xlabel("value")
plt.ylabel("frequency")
plt.legend(loc="upper right")
plt.show()

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled = scaler.fit_transform(x)


Let’s verify the mean and standard deviation after the transformation.

print(f"μ(scaled1)={np.mean(scaled[:,0])}, σ(scaled1)={np.std(scaled[:,0])}")
print(f"μ(scaled2)={np.mean(scaled[:,1])}, σ(scaled2)={np.std(scaled[:,1])}")

μ(scaled1)=-5.346834086594754e-16, σ(scaled1)=0.9999999999999994
μ(scaled2)=-3.488764832582092e-15, σ(scaled2)=0.9999999999999999

import matplotlib.pyplot as plt

plt.subplot(1, 2, 1)
plt.scatter(scaled[:,0], scaled[:,1], c="lightgray", edgecolor="gray")
plt.xlabel("scaled $x_1$")
plt.ylabel("scaled $x_2$")
plt.subplot(1, 2, 2)
plt.hist(scaled[:,0], bins=25, label="scaled $x_1$", facecolor=colours[0], alpha=0.5)
plt.hist(scaled[:,1], label="scaled $x_2$", facecolor=colours[1], alpha=0.5)
plt.xlabel("value")
plt.ylabel("frequency")
plt.legend(loc="upper right")
plt.show()