Estimating the number of clusters in a data set via the gap statistic

We propose a method (the ‘gap statistic’) for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. K‐means or hierarchical), comparing the change in within‐cluster dispersion with that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that have been proposed in the literature.

[1]  G. Walther Detecting the Presence of Mixing with Multiscale Maximum Likelihood , 2002 .

[2]  A. Cuevas,et al.  Estimating the number of clusters , 2000 .

[3]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[4]  A Gordon,et al.  Classification, 2nd Edition , 1999 .

[5]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[6]  A. D. Gordon Null Models in Cluster Validation , 1996 .

[7]  K. Roeder A Graphical Technique for Determining the Number of Components in a Mixture of Normals , 1994 .

[8]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[9]  W. Chan,et al.  Unimodality, convexity, and applications , 1989 .

[10]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[11]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[12]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[13]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[14]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[15]  F. Marriott Practical problems in a method of cluster analysis. , 1971, Biometrics.

[16]  A. Scott,et al.  Clustering methods based on likelihood ratio criteria. , 1971 .