An efficient K -means clustering algorithm for massive data

The analysis of continously larger datasets is a task of major importance in a wide variety of scientific fields. In this sense, cluster analysis algorithms are a key element of exploratory data analysis, due to their easiness in the implementation and relatively low computational cost. Among these algorithms, the K -means algorithm stands out as the most popular approach, besides its high dependency on the initial conditions, as well as to the fact that it might not scale well on massive datasets. In this article, we propose a recursive and parallel approximation to the K -means algorithm that scales well on both the number of instances and dimensionality of the problem, without affecting the quality of the approximation. In order to achieve this, instead of analyzing the entire dataset, we work on small weighted sets of points that mostly intend to extract information from those regions where it is harder to determine the correct cluster assignment of the original instances. In addition to different theoretical properties, which deduce the reasoning behind the algorithm, experimental results indicate that our method outperforms the state-of-the-art in terms of the trade-off between number of distance computations and the quality of the solution obtained.

[1]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[2]  Ian Davidson,et al.  Speeding up k-means Clustering by Bootstrap Averaging , 2003 .

[3]  Greg Hamerly,et al.  Making k-means Even Faster , 2010, SDM.

[4]  Meena Mahajan,et al.  The planar k-means problem is NP-hard , 2009, Theor. Comput. Sci..

[5]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[6]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[7]  Amit Kumar,et al.  A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[8]  Stephen J. Redmond,et al.  A method for initialising the K-means clustering algorithm using kd-trees , 2007, Pattern Recognit. Lett..

[9]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[10]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[11]  José Antonio Lozano,et al.  An efficient approximation to the K-means clustering for massive data , 2017, Knowl. Based Syst..

[12]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[13]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[14]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[15]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[16]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[17]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[18]  Andreas Krause,et al.  Fast and Provably Good Seedings for k-Means , 2016, NIPS.

[19]  Amit Kumar,et al.  A simple linear time ( 1+ ε)- approximation algorithm for geometric k-means clustering in any dimensions , 2004 .

[20]  Michael J. Brusco,et al.  Initializing K-means Batch Clustering: A Critical Evaluation of Several Techniques , 2007, J. Classif..

[21]  Andrea Vattani k-means Requires Exponentially Many Iterations Even in the Plane , 2011, Discret. Comput. Geom..

[22]  François Fleuret,et al.  Nested Mini-Batch K-Means , 2016, NIPS.

[23]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[24]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[26]  Tommi Kärkkäinen,et al.  Introduction to partitioning-based clustering methods with a robust example , 2006 .

[27]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[28]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[29]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[30]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[31]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[32]  Jonathan Drake,et al.  Accelerated k-means with adaptive distance bounds , 2012 .

[33]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[34]  Jirí Matousek,et al.  On Approximate Geometric k -Clustering , 2000, Discret. Comput. Geom..

[35]  E. Forgy Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .