If you’re standing close to others at a party, it’s likely you have something in common. This is the idea behind using k-means clustering to split data points into groups. Whether the groups formed via human agency or some other force, this algorithm will find them.
From detonations to dial tones: American physicist Stuart Lloyd, an alumnus of both Bell Labs’ iconic innovation factory and the Manhattan Project that invented the atomic bomb, first proposed k-means clustering in 1957 to distribute information within digital signals. He didn’t publish it until 1982. Meanwhile, American statistician Edward Forgy described a similar method in 1965, leading to its alternative name, the Lloyd-Forgy algorithm.
Finding the center: Consider breaking up the party into like-minded working groups. Given the positions of attendees in the room and the number of groups to be formed, k-means clustering can divide the attendees into a given number of groups of roughly equal size.
- During training, the algorithm initially designates k cluster center points, or centroids, by randomly choosing k people. (K must be chosen manually, and finding an optimal value is not always trivial.) Then it grows k clusters by associating each person to the closest centroid.
- For each cluster, it computes the mean position of all people assigned to the group and designates the mean position as the new centroid. The new centroids may not be occupied by a person, but so what? People tend to gather around the chocolate fondue.
- Having calculated new centroids, the algorithm reassigns individuals to the centroid closest to them. Then it computes new centroids, adjusts clusters, and so on, until the centroids (and the groups around them) no longer shift.
- From there, assigning newcomers to the right cluster is easy. Let them take their place in the room and look for the nearest centroid. (Be forewarned: Because the algorithm initially selects centroids at random, you may not end up in the same group as that cute data-centric AI specialist you were chatting with. K-means clustering does a good job, but it’s not guaranteed to find the best solution.)
Different distances: The distance between clustered objects doesn’t need to be spatial. Any measure between two vectors will do. For instance, rather than grouping partygoers according to physical proximity, k-means clustering can divide them by their outfits, occupations, or other attributes. Online shops use it to partition customers based on their preferences or behavior, and astronomers to group stars of the same type.
Power to the data points: The idea has spawned a few notable variations:
- K-medoids use actual data points as centroids rather than mean positions in a given cluster. The medoids are points that minimize the distance to all other points in their cluster. This variation is more interpretable because the centroids are always data points.
- Fuzzy C-Means Clustering enables the data points to participate in multiple clusters to varying degrees. It replaces hard cluster assignments with degrees of membership depending on distance from the centroids.
Revelry in n dimensions: Nonetheless, the algorithm in its original form remains widely useful — especially because, as an unsupervised algorithm, it doesn’t require gathering potentially expensive labeled data. It’s also ever faster to use. For instance, machine learning libraries including scikit-learn benefit from the 2002 addition of kd-trees that partition high-dimensional data extremely quickly. By the way, if you throw any high-dimensional parties, we’d love to be on the guest list.