Estimating the number of clusters in a data set via the gap statistic

We propose a method (the ‘gap statistic’) for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. K‐means or hierarchical), comparing the change in within‐cluster dispersion with that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that have been proposed in the literature.

[1]  F. Marriott Practical problems in a method of cluster analysis. , 1971, Biometrics.

[2]  A. Scott,et al.  Clustering methods based on likelihood ratio criteria. , 1971 .

[3]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[4]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[5]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[6]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[7]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[8]  W. Chan,et al.  Unimodality, convexity, and applications , 1989 .

[9]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[10]  K. Roeder A Graphical Technique for Determining the Number of Components in a Mixture of Normals , 1994 .

[11]  A. D. Gordon Null Models in Cluster Validation , 1996 .

[12]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[13]  A Gordon,et al.  Classification, 2nd Edition , 1999 .

[14]  A. Cuevas,et al.  Estimating the number of clusters , 2000 .

[15]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[16]  G. Walther Detecting the Presence of Mixing with Multiscale Maximum Likelihood , 2002 .