K-means clustering


K-means is still one of the fundamental clustering algorithms. It is used in such diverse fields as Natural Language Processing (NLP), social sciences and medical sciences.

The core idea behind K-means is that we want to group data in clusters. Data points will be assigned to a specific cluster depending on it’s distance to a cluster’s center, usually called the centroid.

It is important to note that typically, the mean distance to a centroid is used to partition the clusters, however, difference distances can be used and different pivot points. An example is the K-medoids clustering algorithm.

We will define the two main steps of a generic K-means clustering algorithm, namely the data assignement and the centroid update step.

Data assignement

The criteria to determine whether a point is closer to one centroid is typically an Euclidean distance (\(L^2\)) . If we consider a set of \(n\) centroids \(C\), such that

\[ C = \lbrace c_1, c_2, \dots, c_n \rbrace \]

We assign each data point in \(\mathcal{D}=\lbrace x_1, x_2, \dots, x_n \rbrace\) to the nearest centroid according to its distance, such that

\[ \underset{c_i \in C}{\arg\min} \; dist(c_i,x)^2 \]

As mentioned previously \(dist(\cdot)\) is typically the standard (\(L^2\)) Euclidean distance. We define the subset of points assigned to a centroid \(i\) as \(S_i\).

Centroid update step

This step corresponds to updating the centroids using the mean of add points assign to a cluster, \(S_i\). That is

\[ c_i=\frac{1}{|S_i|}\sum_{x_i \in S_i} x_i \]


Different algorithms can be used for cluster partitioning, for instance:

  • PAM


To illustrate the PAM partitioning method, we will use a synthetic dataset created along the guidelines in synthetic data generation.

In order to use the “Elbow method” we calculate the Within-Cluster Sum of Squares (WCSS) for a varying number of clusters, \(K\).

dataset = pd.read_csv('data/mall-customers.zip')
X = dataset.iloc[:, [3, 4]].values
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans = KMeans(n_clusters = 5, init = "k-means++", random_state = 42)
y_kmeans = kmeans.fit_predict(X)