An efficient K-means algorithm for Massive Data

Due to the progressive growth of the amount of data available in a wide variety of scientific fields, it has become more difficult to ma- nipulate and analyze such information. Even though datasets have grown in size, the K-means algorithm remains as one of the most popular clustering methods, in spite of its dependency on the initial settings and high computational cost, especially in terms of distance computations. In this work, we propose an efficient approximation to the K-means problem intended for massive data. Our approach recursively partitions the entire dataset into a small number of sub- sets, each of which is characterized by its representative (center of mass) and weight (cardinality), afterwards a weighted version of the K-means algorithm is applied over such local representation, which can drastically reduce the number of distances computed. In addition to some theoretical properties, experimental results indicate that our method outperforms well-known approaches, such as the K-means++ and the minibatch K-means, in terms of the relation between number of distance computations and the quality of the approximation.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[3]  Stephen J. Redmond,et al.  A method for initialising the K-means clustering algorithm using kd-trees , 2007, Pattern Recognit. Lett..

[4]  Don-Lin Yang,et al.  An Efficient k-Means Clustering Algorithm Using Simple Partitioning , 2005, J. Inf. Sci. Eng..

[5]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[6]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[7]  Sanjay Ranka,et al.  An effic ient k-means clustering algorithm , 1997 .

[8]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[9]  Jon Louis Bentley,et al.  Quad trees a data structure for retrieval on composite keys , 1974, Acta Informatica.

[10]  Sadique Sheik,et al.  Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring , 2015 .

[11]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[13]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[14]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[15]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[16]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[17]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[18]  Andrea Vattani,et al.  k-means Requires Exponentially Many Iterations Even in the Plane , 2008, SCG '09.

[19]  Ian Davidson,et al.  Speeding up k-means Clustering by Bootstrap Averaging , 2003 .

[20]  Anil K. Jain,et al.  Large-scale parallel data clustering , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[21]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[22]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[23]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[24]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[25]  Hanan Samet,et al.  Spatial Databases , 1992, VLDB.

[26]  Tommi Kärkkäinen,et al.  Introduction to partitioning-based clustering methods with a robust example , 2006 .

[27]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[28]  Thomas S. Huang,et al.  Supporting Ranked Boolean Similarity Queries in MARS , 1998, IEEE Trans. Knowl. Data Eng..

[29]  E. Forgy Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .