A comparison of clustering quality indices using outliers and noise

Quality indices in clustering are used not only to assess the quality of the partitions but also to determine the number of clusters in the final result. When these indices are evaluated in a case study, real data conditions or different clustering algorithms are seldom taken into account. Here, some of the standard indices used in the literature are compared using more realistic databases that include outliers or noisy dimensions, which is more like a real problem-solving approach. Besides, three different clustering methods are used in an attempt to identify different behaviours. Also, the performance of the quality index-clustering algorithm tandem is compared to random grouping, with the aim of running an additional check. The indices are ranked, and index-based conclusions are drawn for all the scenarios.

[1]  G. W. Milligan,et al.  An algorithm for generating artificial test clusters , 1985 .

[2]  L. Hubert,et al.  A general statistical framework for assessing categorical clustering in free recall. , 1976 .

[3]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[4]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  Olatz Arbelaitz,et al.  Towards a standard methodology to evaluate internal cluster validity indices , 2011, Pattern Recognit. Lett..

[6]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[7]  A. Syvänen,et al.  Silhouette scores for assessment of SNP genotype clusters , 2005, BMC Genomics.

[8]  Brian Everitt,et al.  Cluster analysis , 1974 .

[9]  L. Hubert,et al.  Measuring the Power of Hierarchical Cluster Analysis , 1975 .

[10]  Ricardo J. G. B. Campello,et al.  On the Comparison of Relative Clustering Validity Criteria , 2009, SDM.

[11]  Demian Battaglia,et al.  Classification of NPY-Expressing Neocortical Interneurons , 2009, The Journal of Neuroscience.

[12]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[13]  S. Dolnicar,et al.  An examination of indexes for determining the number of clusters in binary data sets , 2002, Psychometrika.

[14]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[15]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[16]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[17]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[18]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[21]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[23]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[24]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[25]  Ahmed Albatineh,et al.  On Similarity Indices and Correction for Chance Agreement , 2006, J. Classif..

[26]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .