How to find the optimal K for K-mean clustering

Donna Lee
4 min readApr 19, 2021

Introduction

The K-means algorithm is one of the most popular methods for clustering unlabeled data into k number of clusters. It tries to make the intra-cluster distance as minimal as possible while keeping the inter-cluster distance at a max.

K-mean clustering steps

  1. Specify k (arguably the most important part — we will come back to how to find this)
  2. The algorithm will initialize centroids by randomly selecting k data points as the centroids without replacement
  3. We assign each data point to the closest centroid
  4. Then we compute the sum of the squared distance between the data point and the centroid of that cluster.
  5. Compute the average distance for all the data points that belong to that cluster
  6. Repeat steps 2–5

Finding K

As you can imagine, the number k is a really important part of the algorithm.

If your data looks something like the scatter plot below, it is easy to see what k should be.

Often times, it is not that easy. What if your data looked like this? It becomes harder to tell visually.

There are two methods, elbow method and silhouette method, that we can use to find the optimal number of clusters.

Elbow Method

In the shopper’s data above, we plotted the age against spending score of 200 shoppers. It is much harder to tell the clusters just by looking at it visually.

To find the optimal k:

  1. Create an array with the two features in question
X = df.iloc[:, [2,4]].values

2. Then we start testing different number of clusters to get the inertia (within-cluster sum-of-squares) for each number of clusters.

inert=[]for i in range(1,11):
kmeans = KMeans(n_clusters= i, init='k-means++', random_state=1)
kmeans.fit(X)
inert.append(kmeans.inertia_)

3. After we have the list of inertia associated with the different number of k clusters, we can plot the elbow graph

4. We can find the optimal number of clusters by looking for the elbow of the graph or the point after which the inertia start decreasing in a linear fashion. Here we can see that sits at 4 clusters so we would use k=4 in our algorithm.

Let’s take a look at how our clusters turned out.

Silhouette Method

Another popular method for determining the optimal number of cluster is the silhouette method.

This method calculates the silhouette coefficient/score which measures the quality of the clustering technique. The coefficient is between [-1, 1]. The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b)(more details: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)

A coefficient of 1 is the best. It means that the data point is very similar to other points in its cluster and the clusters are well apart from each other and clearly distinguished. A coefficient of 0 means there is likely overlap between the clusters. A -1 coefficient means the data points are being assigned to the wrong cluster and a different cluster would be better.

To find optimal k cluster, we want to find the k with the highest silhouette coefficient.

Steps:

  1. Find the mean silhouette score of all samples using silhouette_score from sklearn library.
from sklearn.metrics import silhouette_scoresil = []
kmax = 10

for k in range(2, kmax+1):
kmeans = KMeans(n_clusters = k).fit(X)
labels = kmeans.labels_
sil.append(silhouette_score(X, labels, metric = 'euclidean'))

2. (Optional) Plot the scores against the number of clusters to find the maximum

We can see here that the coefficient is the highest at 4 clusters (same as our elbow method).

Summary

These two methods are the most popular ways of determining the optimal k for k-means clustering. So which one do we use?

Both!

These two methods are not complete alternatives to each other but rather great complements. Think of it as part 1 (elbow method) and part 1.5 (silhouette method). Using the elbow method you can determine the number of clusters then using the silhouette method you can evaluate the clustering by taking a look at the coefficient.

Resources

https://towardsdatascience.com/are-you-solving-ml-clustering-problems-using-k-means-68fb4efa5469

https://medium.com/@cmukesh8688/silhouette-analysis-in-k-means-clustering-cefa9a7ad111

https://www.kaggle.com/vjchoudhary7/kmeans-clustering-in-customer-segmentation/data?select=Mall_Customers.csv

--

--