A Framework for Hierarchical Ensemble Clustering

Ensemble clustering, as an important extension of the clustering problem, refers to the problem of combining different (input) clusterings of a given dataset to generate a final (consensus) clustering that is a better fit in some sense than existing clusterings. Over the past few years, many ensemble clustering approaches have been developed. However, most of them are designed for partitional clustering methods, and few research efforts have been reported for ensemble hierarchical clustering methods. In this article, a hierarchical ensemble clustering framework that can naturally combine both partitional clustering and hierarchical clustering results is proposed. In addition, a novel method for learning the ultra-metric distance from the aggregated distance matrices and generating final hierarchical clustering with enhanced cluster separation is developed based on the ultra-metric distance for hierarchical clustering. We study three important problems: dendrogram description, dendrogram combination, and dendrogram selection. We develop two approaches for dendrogram selection based on tree distances, and we investigate various dendrogram distances for representing dendrograms. We provide a systematic empirical study of the ensemble hierarchical clustering problem. Experimental results demonstrate the effectiveness of our proposed approaches.

[1]  R. Sokal,et al.  THE COMPARISON OF DENDROGRAMS BY OBJECTIVE METHODS , 1962 .

[2]  F. Rohlf,et al.  Tests for Hierarchical Structure in Random Data Sets , 1968 .

[3]  E. N. Adams Consensus Techniques and the Comparison of Taxonomic Trees , 1972 .

[4]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[5]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[6]  Edward N. AdamsIII N-trees as nestings: Complexity, similarity, and consensus , 1986 .

[7]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[8]  D. Swofford When are phylogeny estimates from molecular and morphological data incongruent , 1991 .

[9]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[10]  M. Wilkinson Common Cladistic Information and its Consensus Representation: Reduced Adams and Reduced Cladistic Consensus Trees and Profiles , 1994 .

[11]  J. Felsenstein,et al.  A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. , 1994, Molecular biology and evolution.

[12]  Mikkel Thorup,et al.  On the Agreement of Many Trees , 1995, Inf. Process. Lett..

[13]  Mikkel Thorup,et al.  On the approximability of numerical taxonomy (fitting distances by tree metrics) , 1996, SODA '96.

[14]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[15]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[16]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[17]  János Podani Simulation of Random Dendrograms and Comparison Tests: Some Comments , 2000, J. Classif..

[18]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[19]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[20]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[21]  Bernard De Baets,et al.  Algorithms for computing the min-transitive closure and associated partition tree of a symmetric fuzzy relation , 2004, Eur. J. Oper. Res..

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[24]  Tao Li,et al.  On combining multiple clusterings , 2004, CIKM '04.

[25]  Carla E. Brodley,et al.  Solving cluster ensemble problems by bipartite graph partitioning , 2004, ICML.

[26]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[27]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[28]  Anil K. Jain,et al.  Clustering ensembles: models of consensus and weak partitions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Nir Ailon,et al.  Fitting tree metrics: Hierarchical clustering and phylogeny , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[30]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[31]  Hui Xiong,et al.  Transitive closure and metric inequality of weighted graphs: detecting protein interaction modules using cliques , 2006, Int. J. Data Min. Bioinform..

[32]  Chris H. Q. Ding,et al.  Solving Consensus and Semi-supervised Clustering Problems Using Nonnegative Matrix Factorization , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[33]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[34]  Majid Ahmadi,et al.  A new method for hierarchical clustering combination , 2008, Intell. Data Anal..

[35]  Xiaoli Z. Fern,et al.  Cluster Ensemble Selection , 2008, Stat. Anal. Data Min..

[36]  Abdolreza Mirzaei,et al.  Combining hierarchical clusterings using min-transitive closure , 2008, 2008 19th International Conference on Pattern Recognition.

[37]  Chris H. Q. Ding,et al.  Weighted Consensus Clustering , 2008, SDM.

[38]  Xiaoli Z. Fern,et al.  Adaptive Cluster Ensemble Selection , 2009, IJCAI.

[39]  Tao Li,et al.  On combining multiple clusterings: an overview and a new perspective , 2010, Applied Intelligence.

[40]  Hui Xiong,et al.  Towards understanding hierarchical clustering: A data distribution perspective , 2009, Neurocomputing.

[41]  Abdolreza Mirzaei,et al.  A Novel Hierarchical-Clustering-Combination Scheme Based on Fuzzy-Similarity Relations , 2010, IEEE Transactions on Fuzzy Systems.

[42]  K. Koutroumbas,et al.  On the clustering of foF2 time series corresponding to disturbed ionospheric periods , 2010 .

[43]  Chris H. Q. Ding,et al.  Hierarchical Ensemble Clustering , 2010, 2010 IEEE International Conference on Data Mining.

[44]  Morteza Jalalat-evakilkandi,et al.  A new hierarchical-clustering combination scheme based on scatter matrices and nearest neighbor criterion , 2010, 2010 5th International Symposium on Telecommunications.

[45]  Feiping Nie,et al.  Consensus spectral clustering in near-linear time , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[46]  Tao Li,et al.  Semi-supervised Hierarchical Clustering , 2011, 2011 IEEE 11th International Conference on Data Mining.

[47]  Abdolreza Mirzaei,et al.  A novel multi-clustering method for hierarchical clusterings based on boosting , 2011, 2011 19th Iranian Conference on Electrical Engineering.

[48]  Yong Wang,et al.  An effective ensemble method for hierarchical clustering , 2012, C3S2E '12.

[49]  Yi Wan,et al.  PHA: A fast potential-based hierarchical agglomerative clustering method , 2013, Pattern Recognit..

[50]  Daniel T. Larose,et al.  An Introduction to Data Mining , 2005 .