Fast automatic estimation of the number of clusters from the minimum inter-center distance for k-means clustering

Abstract Center-based clustering methods like k-Means intend to identify closely packed clusters of data points by respectively finding the centers of each cluster. However, k-Means requires the user to guess the number of clusters, instead of estimating the same on the run. Hence, the incorporation of accurate automatic estimation of the natural number of clusters present in a data set is important to make a clustering method truly unsupervised. For k-Means, the minimum of the pairwise distance between cluster centers decreases as the user-defined number of clusters increases. In this paper, we observe that the last significant reduction occurs just as the user-defined number surpasses the natural number of clusters. Based on this insight, we propose two techniques: the Last Leap (LL) and the Last Major Leap (LML) to estimate the number of clusters for k-Means. Over a number of challenging situations, we show that LL accurately identifies the number of well-separated clusters, whereas LML identifies the number of equal-sized clusters. Any disparity between the values of LL and LML can thus inform a user about the underlying cluster structures present in the data set. The proposed techniques are independent of the size of the data set, making them especially suitable for large data sets. Experiments show that LL and LML perform competitively with the best cluster number estimation techniques while imposing drastically lower computational burden.

[1]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[2]  Bo Yuan,et al.  A highly scalable clustering scheme using boundary information , 2017, Pattern Recognit. Lett..

[3]  Sergio M. Savaresi,et al.  On the performance of bisecting K-means and PDDP , 2001, SDM.

[4]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..

[5]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Rajesh N. Davé,et al.  Validating fuzzy partitions obtained through c-shells clustering , 1996, Pattern Recognit. Lett..

[7]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[8]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Miin-Shen Yang,et al.  A cluster validity index for fuzzy clustering , 2005, Pattern Recognit. Lett..

[10]  P. Fränti,et al.  Sum-of-Squares Based Cluster Validity Index and Significance Analysis , 2009, ICANNGA.

[11]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[12]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[13]  S. Dolnicar,et al.  An examination of indexes for determining the number of clusters in binary data sets , 2002, Psychometrika.

[14]  Dervis Karaboga,et al.  A comprehensive survey of traditional, merge-split and evolutionary approaches proposed for determination of cluster number , 2017, Swarm Evol. Comput..

[15]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[16]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[17]  Yu Xue,et al.  A novel density peaks clustering algorithm for mixed data , 2017, Pattern Recognit. Lett..

[18]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[19]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[20]  Ujjwal Maulik,et al.  Validity index for crisp and fuzzy clusters , 2004, Pattern Recognit..

[21]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[22]  Marc Teboulle,et al.  A Unified Continuous Optimization Framework for Center-Based Clustering Methods , 2007, J. Mach. Learn. Res..

[23]  Hanêne Ben-Abdallah,et al.  Unsupervised varied density based clustering algorithm using spline , 2017, Pattern Recognit. Lett..

[24]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Boudewijn P. F. Lelieveldt,et al.  A new cluster validity index for the fuzzy c-mean , 1998, Pattern Recognit. Lett..

[26]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[27]  Min Ren,et al.  A Self-Adaptive Fuzzy c-Means Algorithm for Determining the Optimal Number of Clusters , 2016, Comput. Intell. Neurosci..

[28]  Robert Tibshirani,et al.  Cluster Validation by Prediction Strength , 2005 .

[29]  Lei Xu,et al.  Bayesian Ying-Yang machine, clustering and number of clusters , 1997, Pattern Recognit. Lett..

[30]  Joshua D. Knowles,et al.  An Evolutionary Approach to Multiobjective Clustering , 2007, IEEE Transactions on Evolutionary Computation.

[31]  Pei Chen,et al.  Delta-density based clustering with a divide-and-conquer strategy: 3DC clustering , 2016, Pattern Recognit. Lett..

[32]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[33]  Alexandre Galvão Patriota,et al.  A non-parametric method to estimate the number of clusters , 2014, Comput. Stat. Data Anal..

[34]  J. Hartigan Statistical theory in clustering , 1985 .

[35]  James C. Bezdek,et al.  Validity-guided (re)clustering with applications to image segmentation , 1996, IEEE Trans. Fuzzy Syst..

[36]  J. Bezdek Cluster Validity with Fuzzy Sets , 1973 .

[37]  Michalis Vazirgiannis,et al.  Clustering validity assessment: finding the optimal partitioning of a data set , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[38]  Babak Rezaee,et al.  A cluster validity index for fuzzy clustering , 2010, Fuzzy Sets Syst..