K-means++

In data mining, k-means++[1][2] is an algorithm for choosing the initial values (or "seeds") for the k-means clustering algorithm. It was proposed in 2007 by David Arthur and Sergei Vassilvitskii, as an approximation algorithm for the NP-hard k-means problem—a way of avoiding the sometimes poor clusterings found by the standard k-means algorithm. It is similar to the first of three seeding methods proposed, in independent work, in 2006[3] by Rafail Ostrovsky, Yuval Rabani, Leonard Schulman and Chaitanya Swamy. (The distribution of the first seed is different.)

Background

The k-means problem is to find cluster centers that minimize the intra-class variance, i.e. the sum of squared distances from each data point being clustered to its cluster center (the center that is closest to it). Although finding an exact solution to the k-means problem for arbitrary input is NP-hard,[4] the standard approach to finding an approximate solution (often called Lloyd's algorithm or the k-means algorithm) is used widely and frequently finds reasonable solutions quickly.

However, the k-means algorithm has at least two major theoretic shortcomings:

  • First, it has been shown that the worst case running time of the algorithm is super-polynomial in the input size.[5]
  • Second, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal clustering.

The k-means++ algorithm addresses the second of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard k-means optimization iterations. With the k-means++ initialization, the algorithm is guaranteed to find a solution that is O(log k) competitive to the optimal k-means solution.

Example of a suboptimal clustering

Bad clustering of a rectangle
This is a bad clustering where the points A and D are in the red cluster of centroid E and the points B and C are in the blue cluster of centroid F, as the intra-cluster distance in not minimal

To illustrate the potential of the k-means algorithm to perform arbitrarily poorly with respect to the objective function of minimizing the sum of squared distances of cluster points to the centroid of their assigned clusters, consider the example of four points in that form an axis-aligned rectangle whose width is greater than its height.

Optimal clustering for the problem.

If and the two initial cluster centers lie at the midpoints of the top and bottom line segments of the rectangle formed by the four data points, the k-means algorithm converges immediately, without moving these cluster centers. Consequently, the two bottom data points are clustered together and the two data points forming the top of the rectangle are clustered together—a suboptimal clustering because the width of the rectangle is greater than its height.

Consider now extending the rectangle in a horizontal direction to any desired width. The standard k-means algorithm will continue to cluster the points suboptimally, and by increasing the horizontal distance between the two data points in each cluster, we can make the algorithm perform arbitrarily poorly with respect to the k-means objective function.

Improved initialization algorithm

The intuition behind this approach is that spreading out the k initial cluster centers is a good thing: the first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the point's closest existing cluster center.

The exact algorithm is as follows:

  1. Choose one center uniformly at random among the data points.
  2. For each data point x not chosen yet, compute D(x), the distance between x and the nearest center that has already been chosen.
  3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
  4. Repeat Steps 2 and 3 until k centers have been chosen.
  5. Now that the initial centers have been chosen, proceed using standard k-means clustering.

This seeding method yields considerable improvement in the final error of k-means. Although the initial selection in the algorithm takes extra time, the k-means part itself converges very quickly after this seeding and thus the algorithm actually lowers the computation time. The authors tested their method with real and synthetic datasets and obtained typically 2-fold improvements in speed, and for certain datasets, close to 1000-fold improvements in error. In these simulations the new method almost always performed at least as well as vanilla k-means in both speed and error.

Additionally, the authors calculate an approximation ratio for their algorithm. The k-means++ algorithm guarantees an approximation ratio O(log k) in expectation (over the randomness of the algorithm), where is the number of clusters used. This is in contrast to vanilla k-means, which can generate clusterings arbitrarily worse than the optimum.[6] A generalization of the performance of k-means++ with respect to any arbitrary distance is provided in .[7]

Applications

The k-means++ approach has been applied since its initial proposal. In a review by Shindler,[8] which includes many types of clustering algorithms, the method is said to successfully overcome some of the problems associated with other ways of defining initial cluster-centres for k-means clustering. Lee et al.[9] report an application of k-means++ to create geographical cluster of photographs based on the latitude and longitude information attached to the photos. An application to financial diversification is reported by Howard and Johansen.[10] Other support for the method and ongoing discussion is also available online.[11] Since the k-means++ initialization needs k passes over the data, it does not scale very well to large data sets. Bahman Bahmani et al. have proposed a scalable variant of k-means++ called k-means|| which provides the same theoretical guarantees and yet is highly scalable.[12]

Software

  • Accord.NET contains C# implementations for k-means, k-means++ and k-modes.
  • ALGLIB contains parallelized C++ and C# implementations for k-means and k-means++.
  • Apache Commons Math contains k-means++
  • ELKI data-mining framework contains multiple k-means variations, including k-means++ for seeding.
  • MATLAB has a K-Means implementation that uses k-means++ as default for seeding.
  • OpenCV contains C++ and Python K-means implementation (with optional k-means seed initialization).
  • Orange includes k-means++ UI widget and API support
  • R includes k-means, and the "flexclust" package can do k-means++
  • scikit-learn has a K-Means implementation that uses k-means++ by default.
  • Weka contains k-means (with optional k-means++) and x-means clustering.

References

  1. ^ Arthur, D.; Vassilvitskii, S. (2007). "k-means++: the advantages of careful seeding" (PDF). Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics Philadelphia, PA, USA. pp. 1027–1035.
  2. ^ http://theory.stanford.edu/~sergei/slides/BATS-Means.pdf Slides for presentation of method by Arthur, D. and Vassilvitskii, S.
  3. ^ Ostrovsky, R.; Rabani, Y.; Schulman, L. J.; Swamy, C. (2006). "The Effectiveness of Lloyd-Type Methods for the k-Means Problem". Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06). IEEE. pp. 165–174.
  4. ^ Drineas, P.; Frieze, A.; Kannan, R.; Vempala, S.; Vinay, V. (2004). "Clustering Large Graphs via the Singular Value Decomposition". Machine Learning. 56 (1–3): 9–33. doi:10.1023/B:MACH.0000033113.59016.96.
  5. ^ Arthur, D.; Vassilvitskii, S. (2006). "How slow is the k-means method?". Proceedings of the twenty-second annual symposium on Computational geometry. ACM New York, NY, USA. pp. 144–153.
  6. ^ Kanungo, T.; Mount, D.; Netanyahu, N.; Piatko, C.; Silverman, R.; Wu, A. (2004), "A Local Search Approximation Algorithm for k-Means Clustering", Computational Geometry: Theory and Applications, 28 (2–3): 89–112, doi:10.1016/j.comgeo.2004.03.003.
  7. ^ Nielsen, Frank; Nock, Richard (2013), "Total Jensen divergences: Definition, properties and clustering", 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2016–2020, arXiv:1309.7109, Bibcode:2013arXiv1309.7109N, doi:10.1109/ICASSP.2015.7178324, ISBN 978-1-4673-6997-8, S2CID 463728.
  8. ^ https://web.archive.org/web/20110927100642/http://www.cs.ucla.edu/~shindler/shindler-kMedian-survey.pdf Approximation Algorithms for the Metric k-Median Problem
  9. ^ http://sir-lab.usc.edu/publications/2008-ICWSM2LEES.pdf Archived 2016-03-03 at the Wayback Machine Discovering Relationships among Tags and Geotags, 2007
  10. ^ http://www.cse.ohio-state.edu/~johansek/clustering.pdf[permanent dead link] Clustering Techniques for Financial Diversification, March 2009
  11. ^ Article title[usurped] Lingpipe Blog
  12. ^ B. Bahmani, B. Moseley, A. Vattani, R. Kumar, S. Vassilvitskii "Scalable K-means++" 2012 Proceedings of the VLDB Endowment.