RPZ Assignment: K-means¶
Detailed BRUTE upload instructions at the end of this notebook.
Introduction¶
The K-means algorithm an unsupervised clustering method. Unlike the previous supervised methods, which operated with a set of observations and labels $\mathcal{T} = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}$, unsupervised methods utilize unlabeled observations $\mathcal{T} = \{\mathbf{x}_1, \ldots, \mathbf{x}_n\}$. Their objective is to infer properties of a probability distribution $p_X(\mathbf{x})$.
In low-dimensional spaces ($\text{dim}(\mathbf{x}) \leq 3$), direct estimation of the probability density is feasible using various techniques (e.g. histograms, Parzen windows). Additionally, if prior information about a specific probability distribution type is available, direct estimation of the distribution parameters becomes possible.
However, these methods encounter challenges in higher dimensions due to the curse of dimensionality. As dimensionality increases, the volume grows exponentially, leading to increasingly sparse data. In such cases, techniques that describe the data in a more generalized manner, rather than identifying the complete underlying distribution $p_X(\mathbf{x})$, are preferred.
The main groups of unsupervised methods includes:
- dimensionality reduction methods
- clustering methods
Dimensionality reduction methods aim to map data onto a lower-dimensional manifold while minimizing information loss. Clustering methods involve partitioning observations into groups with minimal differences between the samples within each group.
Unsupervised methods find application in various domains, such as smart feature selection, data visualization, efficient data encoding, adaptive classification, and image segmentation.
K-means¶
The K-means algorithm is a widely used clustering approach. Its primary objective is to determine a clustering $\{\mathcal{T}_k\}^K_{k=1}$ of the observations that minimizes the within-cluster sum of squares (WCSS): $$\{\mathbf{c}_1, \ldots, \mathbf{c}_K; \mathcal{T}_1, \ldots, \mathcal{T}_K\} = \underset{\text{all }\mathbf{c}^{'}_k, \mathcal{T}^{'}_k} {\operatorname{arg\,min}} \sum_{k=1}^{K} \sum_{\mathbf{x} \in \mathcal{T}^{'}_k} \| \mathbf{x} - \mathbf{c}^{'}_k \|^2_2,$$ where $K$ is a user specified number of clusters, $\mathcal{T}_1,\ldots,\mathcal{T}_K$ are sets of observations belonging to clusters $1,\ldots,K$ and each cluster is represented by its prototype $\mathbf{c}_k$.
K-means algorithm¶
Input: set of observations $\mathcal{T} = \{\mathbf{x}_1, \ldots, \mathbf{x}_n\}$, number of clusters $K$, maximum number of iterations.
Output: set of cluster prototypes $\{\mathbf{c}_1, \ldots, \mathbf{c}_K\}$.
Initialize $\{\mathbf{c}_1, \ldots, \mathbf{c}_K\}$ with random unique input observations $\mathbf{x}_i$.
Do:
- Classify observations $\{\mathbf{x}_l\}^L_{l=1}$ to the cluster represented by the nearest prototype $\{\mathbf{c}_k\}^K_{k=1}$: $$y_l = \arg\min_{k=1,\ldots,K}\|\mathbf{x}_l - \mathbf{c}_k\|^2_2 \qquad (\forall l = 1,\ldots,L),$$ $$ \mathcal{T}_k = \{\mathbf{x} \in \mathcal{T}: \forall l, y_l = k\} \qquad (\forall k = 1,\ldots,K) $$
- Update the prototypes $\mathbf{c}_k$ to the mean value of all vectors $\mathbf{x}_l$ belonging to $k$-th class:
Until no change in clustering $\mathcal{T}_k$ or maximum number of iterations is exceeded.
%load_ext autoreload
%autoreload 2
# uncomment following for interactive matplotlib
# %matplotlib notebook
from kmeans import *
import scipy.io
import numpy as np
import matplotlib.pyplot as plt
from numpy import testing
Part 1: Implement K-means¶
Your first task is to implement the basic K-means algorithm based on the description provided above. Refer to the template function k_means
for input details.
Test the algorithm on simple 1D data:
# In order to obtain the same results, use this initialization of the random number generator
# (DO NOT put this into functions submitted to the upload system).
np.random.seed(0)
x = np.array([[16, 12, 50, 96, 34, 59, 22, 75, 26, 51]])
cluster_labels, centroids, sq_dists = k_means(x, k=3, max_iter=np.inf)
np.testing.assert_array_equal(cluster_labels, [1, 1, 0, 0, 2, 0, 1, 0, 1, 0])
np.testing.assert_array_equal(centroids, [[66, 19, 34]])
np.testing.assert_array_equal(sq_dists, [9.0, 49.0, 256.0, 900.0, 0.0, 49.0, 9.0, 81.0, 49.0, 225.0])
Test on images for limited number of iterations.
np.random.seed(0)
images = np.load("data_kmeans.npz", allow_pickle=True)["images"]
x = compute_measurements(images)
_, centroids, _ = k_means(x, k=3, max_iter=4)
np.testing.assert_array_almost_equal(centroids, [[330.84821429, -115.04545455, -4566.76],
[-2348.99107143, 658.20454545, 2375.78]])
Test and visualise on synthetic 2D data. Try to run the algorithm several times with random initialization. You should meet several runs when the clustering converges to a local optima only.
# np.random.seed(19)
k = 4
x = gen_kmeanspp_data()
print(f'K-means with K = {k}...')
centroids = x[:, np.random.choice(range(x.shape[1]), k, replace=False)]
show_clusters(x, np.ones([1, x.shape[1]]), centroids)
plt.title('K-means random INITIALIZATION')
plt.savefig('random_initialization.png')
cluster_labels, centroids, _ = k_means(x, k, np.inf, False, centroids)
show_clusters(x, cluster_labels, centroids)
plt.title('K-means RESULT (random initialization)');
K-means with K = 4...
Finally, train K-means on real MNIST image dataset and visualize the clusters both in feature space and as a set of images.
data = np.load("data_numbers_kmeans.npz", allow_pickle=True)
numbers_images = data["images"]
# np.random.seed(0)
k = 10
print(f'K-means, K = {k}...')
x = compute_measurements(numbers_images)
cluster_labels, centroids, _ = k_means(x, k, np.inf)
show_clusters(x, cluster_labels, centroids, title='K-means result for 10 classes of digits', figsize=(6,6))
show_clustered_images(numbers_images, cluster_labels, title='Visualisation of the cluster members');
show_mean_images(numbers_images, cluster_labels, title='Mean images of the cluster members');
K-means, K = 10...
Part 2: Avoiding the local minima¶
The solution that K-means reaches depends on the initialization of the cluster means and there is no guarantee that it reaches the global optimum, as you have seen above. The chance of getting trapped in a local minimum can be reduced by performing more trials of the K-means clustering and choosing the result that minimizes WCSS.
In the following, we will extract measurements from the one of the unlabelled datasets:
data_kmeans.npz
, which consists of images of three letters - H, L, T (here K = 3)- the MNIST dataset
data_mnist_kmeans.npz
of hand written digits, K = 10
data = np.load("data_letters_kmeans.npz", allow_pickle=True)
letters_images = data["images"]
Implement the function k_means_multiple_trials(x, k, n_trials, max_iter)
which tries to avoid being trapped in local minimum by performing K-means several times and selecting the solution that minimizes the WCSS.
# np.random.seed(0)
x = compute_measurements(letters_images)
k = 3
cluster_labels_m, centroids_m, x_dists_m = k_means_multiple_trials(x, k, 10, np.inf, show=True)
show_clusters(x, cluster_labels_m, centroids_m, title='The best pick from multiple trials')
show_clustered_images(letters_images, cluster_labels_m, title='Cluster visualisation for the best pick from multiple trials')