K-means Clustering from Scratch

NumPy is all you need

Sayar Banerjee

Jul 3, 2020 8 min read

An Algorithm must be seen to be believed — Donald Knuth

Overview

The science of Machine Learning can be broadly classified into two categories:

In this blog post, we will be implementing a popular unsupervised learning algorithm, k-means clustering.

This popular algorithm uses numerical distance measures to partition data into clusters.

Algorithm

Let’s say we have a bunch of observations and we want to segment “similar” observations together. We will use the following algorithm to achieve our goal.

K-means algorithm

Input: k (number of clusters), D (data points)

Choose random k data points as initial clusters mean
Associate each data point in D to the nearest centroid.
This will divide the data into k clusters.
Recompute centroids
Repeat step 2 and step 3 until there are no more changes
of cluster membership of the data points.

Let us look at the above algorithm in a bit more detail.

We first assign each data point to a cluster randomly. We then compute the cluster means for each group of clusters.

After that, we proceed to compute the squared Euclidian distance between each point and cluster means. We then assign a cluster to each data point based on the smallest squared euclidian distance between that data point and the cluster means for each cluster.

The cluster means are then recomputed and we continue reassigning each data point based on the squared euclidian distance until no data point’s cluster assignment is changed.

If one were to ask a statistician, she/he might tell you that we are trying to minimize the within-cluster sum of squares (WCSS). Let’s now try to implement this algorithm in Python.

Implementation

Though there are many library implementations of the k-means algorithm in Python, I decided to use only Numpy in order to provide an instructive approach. Numpy is a popular library in Python used for numerical computations.

Code Walkthrough

import numpy as np
import tqdm
import itertools
import matplotlib.pyplot as plt


class Kmeans:
    def __init__(self, k=3):
        self.k = k
        self.means = None
        self._cluster_ids = None

We first create a class called Kmeans and pass a single constructor argumentk to it. This argument is a hyperparameter. Hyperparameters are parameters that are set by the user before training the machine learning algorithm. In our case, this is the total number of clusters we wish to partition our data into. We also add two more attributes to the constructor, means which will store the cluster means and _cluster_ids which stores the id values of the clusters.

import numpy as np
import tqdm
import itertools
import matplotlib.pyplot as plt


class Kmeans:
    def __init__(self, k=3):
        self.k = k
        self.means = None
        self._cluster_ids = None

    @property
    def cluster_ids(self):
        return self._cluster_ids

    def _init_centroid(self, m):
        return np.random.randint(0, self.k, m)

We then create a method called cluster_ids which acts as a get method for our cluster ids. @property is a function decorator. To learn more about this, check out this article. Another method called _init_centroid is created to randomly assign each data point to a cluster.

import numpy as np
import tqdm
import itertools
import matplotlib.pyplot as plt


class Kmeans:
    def __init__(self, k=3):
        self.k = k
        self.means = None
        self._cluster_ids = None

    @property
    def cluster_ids(self):
        return self._cluster_ids

    def _init_centroid(self, m):
        return np.random.randint(0, self.k, m)

    def _cluster_means(self, X, clusters):
        m, n = X.shape[0], X.shape[1]
        # Extra column to store cluster ids
        temp = np.zeros((m, n + 1))
        temp[:, :n], temp[:, n] = X, clusters
        result = np.zeros((self.k, n))
        for i in range(self.k):
            subset = temp[np.where(temp[:, -1] == i), :n]
            if subset[0].shape[0] > 0:
                result[i] = np.mean(subset[0], axis=0)
            # Choose random data point if a cluster does not 
            # have any data associated with it
            else:
                result[i] = X[np.random.choice(X.shape[0], 1, replace=True)]

        return result

    def _compute_cluster(self, x):
        # Computes closest means to a data point x
        return min(range(self.k), key=lambda i: np.linalg.norm(x - self.means[i])**2)

_cluster_means computes the means of our clusters. It accepts a Numpy array containing the data and another Numpy array which has the cluster ids as input. We use a temporary array temp to store our features and the cluster ids. We then compute the means of every data point in each cluster and return it as an array.

Note that there could be some clusters which may not have any data (because we randomly assign clusters initially). Hence, if there is a cluster with no data, we randomly select an observation to be a part of that cluster.

_compute_cluster is the method that determines which cluster’s means are closest to a data point. The np.linalg.norm() method does the computation for the euclidean distance. We square this to get the within-cluster sum of squares.

import numpy as np
import tqdm
import itertools
import matplotlib.pyplot as plt


class Kmeans:
    def __init__(self, k=3):
        self.k = k
        self.means = None
        self._cluster_ids = None

    @property
    def cluster_ids(self):
        return self._cluster_ids

    def _init_centroid(self, m):
        return np.random.randint(0, self.k, m)

    def _cluster_means(self, X, clusters):
        m, n = X.shape[0], X.shape[1]
        # Extra column to store cluster ids
        temp = np.zeros((m, n + 1))
        temp[:, :n], temp[:, n] = X, clusters
        result = np.zeros((self.k, n))
        for i in range(self.k):
            subset = temp[np.where(temp[:, -1] == i), :n]
            if subset[0].shape[0] > 0:
                result[i] = np.mean(subset[0], axis=0)
            # Choose random data point if a cluster does not 
            # have any data associated with it
            else:
                result[i] = X[np.random.choice(X.shape[0], 1, replace=True)]

        return result

    def _compute_cluster(self, x):
        # Computes closest means to a data point x
        return min(range(self.k), key=lambda i: np.linalg.norm(x - self.means[i])**2)

    def fit(self, X, num_iterations=None):
        m = X.shape[0]
        # Initialize clusters
        initial_clusters = self._init_centroid(m)
        new_clusters = np.zeros(initial_clusters.shape)
        with tqdm.tqdm(itertools.count()) as t:
            for _ in t:
                # Compute cluster means
                self.means = self._cluster_means(X, initial_clusters)
                for i in range(m):
                    # Assign new cluster ids
                    new_clusters[i] = self._compute_cluster(X[i])
                # Check for data points that have switched cluster ids.
                count_changed = (new_clusters != initial_clusters).sum()
                if count_changed == 0:
                    break
                initial_clusters = new_clusters
                t.set_description(f"changed: {count_changed} / {X.shape[0]}")

        self._cluster_ids = new_clusters

Finally, we create the fit method that orchestrates the clustering process.

Steps in fit() method:

We first initialize each observation to a cluster. We also create an array of zeroes to store the new cluster ids.
We then use the function itertools.count() to create an infinite loop and compute cluster means.
We then assign new cluster ids based on the squared distance between the cluster means and each data point.
We then check if any data points changed clusters. If they did, then we use the new cluster ids to recompute the cluster means.
Steps 2 to 4 are repeated until no data points change clusters.

And there you have it, folks! You have successfully created your own k means clustering class capable of clustering data. Here are some results on a few datasets:

Visualizations

Choosing the value of k

Since k is a hyperparameter, we have to have some methodology in order to pick an optimal value of k. One popular method is the elbow method.

In short, the elbow method plots a curve of the number of clusters vs percentage of explained variation. The curve produced by the elbow method is used by practitioners to determine the optimal number of clusters by following the law of diminishing returns.

If adding an extra cluster does not significantly improve the variation of k, we choose to stick to the current number of clusters.

Tips and Optimizations

Here are a couple of tips to ensure good clustering is obtained:

Removing non-numeric features:
Data may have non-numeric (categorical) features represented as numeric features. Instead of the numbers having some quantitative value, they might be used as labels for a group. For eg. if we are dealing with a population dataset, a column named “Gender” may have values 0 and 1 representing Male and Female. We must be careful in removing these features as they do not have any quantitative value and hence, will distort our algorithm’s notion of ‘distance’.
**Feature Scaling:
**Numeric data will have different ranges. A particular feature with a huge range may adversely impact our clustering objective function. The feature with the big range values will dominate the clustering process over other features. Hence, it is crucial to scale our data so that the contribution of each feature is proportional to the algorithm.
**Better Initialization:
**In our algorithm, we randomly assign the initial clusters to the data. Because of this inherent randomness, our algorithm may not always provide good clusters. There are a couple of ways by which the criterion for setting the initial clusters is improved. The k-means++ algorithm is a popular choice for this task.
**Different Algorithms:
**There are certain algorithms that are variants of the k-means algorithm which are more robust in handling certain constraints such as outliers. One such algorithm is the k-medoids. The k-medoids algorithm uses L1 distance instead of L2 distance (Euclidean distance). There are a bunch of other clustering algorithms that are useful for specific applications such as hierarchal clustering, density-based clustering, fuzzy clustering, etc.

Conclusion

I hope all of you enjoyed this blog post. For more articles on Data Science check out my other posts on medium. Feel free to connect with me on LinkedIn. The code for this blog post is on my Github.

References

scikit-learn 0.23.1 documentation

k-means clustering

Data Science from Scratch, 2nd Edition