Multi-objective selection for collecting cluster alternatives

Grouping objects into different categories is a basic means of cognition. In the fields of machine learning and statistics, this subject is addressed by cluster analysis. Yet, it is still controversially discussed how to assess the reliability and quality of clusterings. In particular, it is hard to determine the optimal number of clusters inherent in the underlying data. Running different cluster algorithms and cluster validation methods usually yields different optimal clusterings. In fact, several clusterings with different numbers of clusters are plausible in many situations, as different methods are specialized on diverse structural properties. To account for the possibility of multiple plausible clusterings, we employ a multi-objective approach for collecting cluster alternatives (MOCCA) from a combination of cluster algorithms and validation measures. In an application to artificial data as well as microarray data sets, we demonstrate that exploring a Pareto set of optimal partitions rather than a single solution can identify alternative solutions that are overlooked by conventional clustering strategies. Competitive solutions are hereby ranked following an impartial criterion, while the ultimate judgement is left to the investigator.

[1]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[2]  Shai Ben-David,et al.  A Sober Look at Clustering Stability , 2006, COLT.

[3]  Alexander Schliep,et al.  Clustering cancer gene expression data: a comparative study , 2008, BMC Bioinformatics.

[4]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[5]  Kurt Hornik,et al.  A quantitative comparison of functional MRI cluster analysis , 2004, Artif. Intell. Medicine.

[6]  W. J. Conover,et al.  Practical Nonparametric Statistics , 1972 .

[7]  Guy N. Brock,et al.  clValid , an R package for cluster validation , 2008 .

[8]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[9]  Kurt Hornik,et al.  Ensemble Methods for Cluster Analysis , 2005 .

[10]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[11]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[12]  Thomas Villmann,et al.  Batch and median neural gas , 2006, Neural Networks.

[13]  Anil K. Jain,et al.  Bootstrap technique in cluster analysis , 1987, Pattern Recognit..

[14]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[15]  Walter Krämer,et al.  Review of Modern applied statistics with S, 4th ed. by W.N. Venables and B.D. Ripley. Springer-Verlag 2002 , 2003 .

[16]  Robert Sabourin,et al.  Overfitting cautious selection of classifier ensembles with genetic algorithms , 2009, Inf. Fusion.

[17]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Anil K. Jain,et al.  Multiobjective data clustering , 2004, CVPR 2004.

[19]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[20]  Kalyanmoy Deb,et al.  Multi-objective optimization using evolutionary algorithms , 2001, Wiley-Interscience series in systems and optimization.

[21]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[22]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[23]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[24]  Claudia Mauri,et al.  Therapeutic activity of agonistic monoclonal antibodies against CD40 in a chronic autoimmune inflammatory process , 2000, Nature Medicine.

[25]  Hans A. Kestler,et al.  A highly efficient multi-core algorithm for clustering extremely large datasets , 2010, BMC Bioinformatics.

[26]  Debashis Ghosh,et al.  Cluster stability scores for microarray data in cancer studies , 2003, BMC Bioinformatics.

[27]  Alfred Taudes,et al.  Adaptive Information Systems and Modelling in Economics and Management Science , 2005 .

[28]  A. Bertoni,et al.  Random projections for assessing gene expression cluster stability , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[29]  Shai Ben-David,et al.  Stability of k -Means Clustering , 2007, COLT.

[30]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[31]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Multi-Objective Clustering Ensemble , 2006, 2006 Sixth International Conference on Hybrid Intelligent Systems (HIS'06).

[32]  Friedrich Leisch,et al.  Evaluation of structure and reproducibility of cluster solutions using the bootstrap , 2010 .

[33]  K. Hornik,et al.  Voting in clustering and finding the number of clusters , 1999 .

[34]  Ahmed Albatineh,et al.  On Similarity Indices and Correction for Chance Agreement , 2006, J. Classif..

[35]  Robert Sabourin,et al.  Solution over-Fit Control in Evolutionary Multiobjective Optimization of Pattern Classification Systems , 2009, Int. J. Pattern Recognit. Artif. Intell..

[36]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..