Clustering with a new distance measure based on a dual-rooted tree

Abstract This paper introduces a novel distance measure for clustering high dimensional data based on the hitting time of two Minimal Spanning Trees (MST) grown sequentially from a pair of points by Prim’s algorithm. When the proposed measure is used in conjunction with spectral clustering, we obtain a powerful clustering algorithm that is able to separate neighboring non-convex shaped clusters and to account for local as well as global geometric features of the data set. Remarkably, the new distance measure is a true metric even if the Prim algorithm uses a non-metric dissimilarity measure to compute the edges of the MST. This metric property brings added flexibility to the proposed method. In particular, the method is applied to clustering non Euclidean quantities, such as probability distributions or spectra, using the Kullback–Leibler divergence as a base measure. We reduce computational complexity by applying consensus clustering to a small ensemble of dual rooted MSTs. We show that the resultant consensus spectral clustering with dual rooted MST is competitive with other clustering methods, both in terms of clustering performance and computational complexity. We illustrate the proposed clustering algorithm on public domain benchmark data for which the ground truth is known, on one hand, and on real-world astrophysical data on the other hand.

[1]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[2]  Duoqian Miao,et al.  A graph-theoretical clustering method based on two rounds of minimum spanning trees , 2010, Pattern Recognit..

[3]  L. Hubert,et al.  Comparing partitions , 1985 .

[4]  John L. Rhodes,et al.  Algebraic Principles for the Analysis of a Biochemical System , 1967, J. Comput. Syst. Sci..

[5]  M. Basseville Distance measures for signal processing and pattern recognition , 1989 .

[6]  Chein-I Chang,et al.  An information-theoretic approach to spectral variability, similarity, and discrimination for hyperspectral image analysis , 2000, IEEE Trans. Inf. Theory.

[7]  W. Stuetzle,et al.  A Generalized Single Linkage Method for Estimating the Cluster Tree of a Density , 2010 .

[8]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[9]  Zhiwen Yu,et al.  Graph-based consensus clustering for class discovery from gene expression data , 2007, Bioinform..

[10]  Edwin R. Hancock,et al.  Graph characteristics from the heat kernel trace , 2009, Pattern Recognit..

[11]  Anil K. Jain,et al.  Clustering ensembles: models of consensus and weak partitions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Jing-Yu Yang,et al.  A tree-structured framework for purifying "complex" clusters with structural roles of individual data , 2010, Pattern Recognit..

[13]  Alfred O. Hero,et al.  Applications of entropic spanning graphs , 2002, IEEE Signal Process. Mag..

[14]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[16]  Alfred O. Hero,et al.  Dual Rooted-Diffusions for Clustering and Classification on Manifolds , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[18]  Basilis Boutsinas,et al.  On clustering tree structured data with categorical nature , 2008, Pattern Recognit..

[19]  Robert Gentleman,et al.  Distance Measures in DNA Microarray Data Analysis , 2005 .

[20]  P. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 1999 .

[21]  Edwin R. Hancock,et al.  Clustering and Embedding Using Commute Times , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Nuno Vasconcelos,et al.  On the efficient evaluation of probabilistic similarity functions for image retrieval , 2004, IEEE Transactions on Information Theory.

[23]  Ronald R. Coifman,et al.  Data Fusion and Multicue Data Matching by Diffusion Maps , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Tetsuo Asano,et al.  Clustering algorithms based on minimum and maximum spanning trees , 1988, SCG '88.

[25]  Michael I. Jordan,et al.  Learning Spectral Clustering , 2003, NIPS.

[26]  Ana L. N. Fred,et al.  Learning Pairwise Similarity for Data Clustering , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[27]  Pierre Comon,et al.  Unsupervised clustering on multi-component datasets: Applications on images and astrophysics data , 2008, 2008 16th European Signal Processing Conference.

[28]  Harm J. Habing,et al.  Objects in transition from the AGB to the planetary nebula stage - New visual and infrared observations , 1989 .

[29]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[30]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[31]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[32]  Jan de Leeuw,et al.  Modern Multidimensional Scaling: Theory and Applications (Second Edition) , 2005 .

[33]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[34]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[35]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[36]  Alfred O. Hero,et al.  Graph based k-means clustering , 2012, Signal Process..

[37]  Joseph L. Zinnes,et al.  Theory and Methods of Scaling. , 1958 .

[38]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[39]  Mohamed S. Kamel,et al.  Cumulative Voting Consensus Method for Partitions with Variable Number of Clusters , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  G. Stasińska,et al.  An evolutionary catalogue of galactic post-AGB and related objects , 2007, astro-ph/0703717.

[41]  R. Prim Shortest connection networks and some generalizations , 1957 .

[42]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[43]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[44]  Stephen P. Borgatti,et al.  Visualizing Proximity Data , 2007 .

[45]  Sergios Theodoridis,et al.  Pattern Recognition , 1998, IEEE Trans. Neural Networks.

[46]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[47]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[48]  Joydeep Ghosh,et al.  Multiclassifier Systems: Back to the Future , 2002, Multiple Classifier Systems.

[49]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[50]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[51]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[52]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[53]  Richard M. Leahy,et al.  An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[54]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[55]  Edwin R. Hancock,et al.  A probabilistic spectral framework for grouping and segmentation , 2004, Pattern Recognit..