One of the most basic building blocks in Data Mining is the clustering problem – given a set of untagged (hence, by the way, it is considered an unsupervised) observations, the goal is to group them in such a way that observations of the same group (a.k.a cluster ) are more similar to each other than to those in other groups.
When dealing with real-world data problems this obviously leads to the question of what makes two observations similar. Therefor, an inherent preliminary step in the clustering problem is to project the observations to a space (which is often multi-dimensional) of which each dimension (or feature ) is crucial to the definition of similarity between observations. For example, by trying to cluster real-estate properties, some reasonable features could be: the size of the property , the number of rooms , the proximity to city-center , etc.
K-Means
There are different methods aiming to solve the unsupervised clustering problem. In problems where the features are continuous, the K-Means algorithm is undoubtedly the most used one, mostly for its simplicity. In this post I’ll give a brief introduction to the theory behind it (which will be explained in better details in a later post), followed by enumerating some of its underlying assumptions (and hence its characteristics), so by the end of reading you should be able to decide when to use it and to search for a better alternative.
Model
As many other problems in Machine Learning, the K-Means clustering problem can be formulated as an optimization problem. given a set of observations , partition it into k subsets (clusters) with the corresponding centroids (cluster center points, i.e. the mean of points in a subset) so as to minimize a cost function of sum of inter-cluster variances :
K-Means is defined over two sets of target variables:
- Mapping of each observation to exactly one cluster
- The clusters centroids, denoted
which are mutually depend one on the other – on one hand, a cluster’s centroid is affected by the observations that belong to its cluster; on the other hand, the mapping of each observation to a cluster is naturally defined by its nearest centroid. Therefor, this problem is analytically intractable, and a two-step iterative method is used:
Initialization step – At first, k cluster centroids are chosen (for now assume this is done in a random fashion. A less naive methodology is brought in the end of the post).
- Map each observation to a cluster according to the closest centroid rule ( )
- According to the most updated observations-clusters mappings (from step 1), update each cluster’s centroid to be the mean of points that belong to its cluster.
- Repeat steps 1+2 until the cost function converges (which is guaranteed eventually, and empirically-speaking, is it doesn’t take a lot of iterations to converge).
Assumptions
In the Data Mining world, it is extremely important to be able to understand what are the cons and pros of applying each technique/algorithm to your data. In my opinion, it is the only way to guarantee that the results reflect the true performance of your application, and are not tailored to the training set alone (a.k.a “overfitting”). More specifically, K-Means is often being criticized for its underlying assumptions which are too naive for a large set of datasets/problems. So let’s revisit this model from a different perspective and enumerate its assumptions and the nature of the results:
- Clusters’ Shape – Since in the assignment step each observation is mapped into its nearest cluster’s centroid , in each of K-Means’ iterations the space is divided into what’s called a Voronoi Diagram . So, for example, applying K-Means on the following dataset
should result in the following clustering (the black dots are the centroids):
Note the lines that divide the space – they are the actual boundaries between the clusters , so when futures observations will be mapped to any one of the clusters according to the nearest centroid rule, this is how the space partition will look like. This is not necessarily a bad result, but neither is always a good one – the shapes of the clusters are not what you’d expect in some applications.
2. Clusters’ Size – let’s look at another example:
This time it looks like data is generated from two cluster which are different in their size . Let’s assume, again, that K-Means is properly initialized so that it captures the real centroids. Let me remind you, the clustering problem is all about partitioning the space into regions such that each region includes observations that are “similar” (close to each other in the feature space – in this example x and y axes) .
Try to think of where you should have drawn a line between those two clusters, so when a future observation arrives, you can map it to either one of them.
Now let’s find out where would K-Means draw this line:
Doesn’t look good, right? That’s because K-Means does not account for neither cluster sizes nor the density (but rather for the distance to the centroids). In the example above the smaller cluster have 50 dense observations and the bigger cluster have 250 sparser observations.
3. Dependent Features – due to the fact that naturally the distance that K-Means computes from each observation to the centroids is the Euclidean distance, it doesn’t capture dependencies in the feature space. So for example, if large x values mean large y values ( x and y are dependent), as in this example:
then K-Means fails to capture this, and at best results with this poor partition of the space:
Assuming the same dependency nature applies to all cluster equally, this actually can be solved by preprocessing the data using a Principal Component Analysis (PCA) , or modifying your K-Means implementation to compute Mahalanobis distance rather than Euclidean distance (both are mathematically equivalent). But sometimes even this assumption does not hold, as in the next section.
4. Within-Cluster Feature Dependencies – Consider the following dataset:
where features (x and y axes) seem to be strongly correlated, but in contrast to the previous section, the nature of features-correlation is dependent of (and varying across) the clusters. It seems that in the upper-left cluster, bigger x-values mean bigger y-values, but in the lower-right cluster bigger x-values means lower y-values.
Unfortunately, K-Means will fail to capture this kind of dependencies, and generally there is no robust-and-simple way of augmenting K-Means with the ability to overcome this rebellious characteristic that is inherent to such datasets, in contrast to a global feature dependency like in the last section, that is ultimately can be solved by PCA. If you wondered, this is how K-Means solution will look like:
Having said that, if this is your case, you might want to check out Independent Component Analysis . A bigger problem arises when data is not easy to visualize (as in multi-dimensional data). Computing the covariance/correlation matrix can help capture global dependencies between features, but will fail to capture cluster-dependent correlation like the one introduced here.
5. Convergence to Local Minimum – Finally, due to the fact that this algorithm is a special case of the Expectation-Maximization (EM) paradigm (which is out the scope of this post, but I’ll write on it more broadly in the next post), it is guaranteed to converge – but not to the optimal solution. Similarly to gradient-descent methods (a common ML optimizing method), EM algorithms are guaranteed to converge to local minimum. Since K-Means’ cost function is not convex, initializing the target variables (cluster-centroids / observation-cluster mappings) is extremely important – as it directly affects the quality of results.
Various heuristics (K-Means extensions) exist that aim to solve this problem. For more information please refer to K-Means++ and Scalable K-Means++ .
Liked that post?to get a notification when a similar post is published! Want to choose what will I write on next?!