Initialization Free Graph Based Clustering

This paper proposes an original approach to cluster multi-component data sets, including an estimation of the number of clusters. From the construction of a minimal spanning tree with Prim’s algorithm, and the assumption that the vertices are approximately distributed according to a Poisson distribution, the number of clusters is estimated by thresholding the Prim’s trajectory. The corresponding cluster centroids are then computed in order to initialize the generalized Lloyd’s algorithm, also known as K-means, which allows to circumvent initialization problems. Some results are derived for evaluating the false positive rate of our cluster detection algorithm, with the help of approximations relevant in Euclidean spaces. Metrics used for measuring similarity between multi-dimensional data points are based on symmetrical divergences. The use of these informational divergences together with the proposed method leads to better results, compared to other clustering methods for the problem of astrophysical data processing. Some applications of this method in the multi/hyper-spectral imagery domain to a satellite view of Paris and to an image of the Mars planet are also presented. In order to demonstrate the usefulness of divergences in our problem, the method with informational divergence as similarity measure is compared with the same method using classical metrics. In the astrophysics application, we also compare the method with the spectral clustering algorithms.

[1]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[3]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[4]  J. Wolfowitz,et al.  An Introduction to the Theory of Statistics , 1951, Nature.

[5]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[6]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[7]  Jon Atli Benediktsson,et al.  On the decomposition of Mars hyperspectral data by ICA and Bayesian positive source separation , 2008, Neurocomputing.

[8]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[9]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[10]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[11]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[12]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  J. Gower,et al.  Minimum Spanning Trees and Single Linkage Cluster Analysis , 1969 .

[14]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[15]  Ying Xu,et al.  Identification of Regulatory Binding Sites Using Minimum Spanning Trees , 2003, Pacific Symposium on Biocomputing.

[16]  R. Prim Shortest connection networks and some generalizations , 1957 .

[17]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[18]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[19]  Chein-I Chang,et al.  An information-theoretic approach to spectral variability, similarity, and discrimination for hyperspectral image analysis , 2000, IEEE Trans. Inf. Theory.

[20]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[21]  N. S. Rebello,et al.  Supervised and Unsupervised Spectral Angle Classifiers , 2002 .

[22]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[23]  Richard P. Binzel,et al.  Phase II of the Small Main-Belt Asteroid Spectroscopic Survey: The Observations , 2002 .

[24]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[25]  Sylvain Douté,et al.  WAVANGLET: An Efficient Supervised Classifier for Hyperspectral Images , 2007, IEEE Transactions on Geoscience and Remote Sensing.

[26]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[27]  J. D. Gorman,et al.  Alpha-Divergence for Classification, Indexing and Retrieval (Revised 2) , 2002 .

[28]  David B. Lomet,et al.  The hB-tree: a multiattribute indexing method with good guaranteed performance , 1990, TODS.

[29]  Franklin A. Graybill,et al.  Introduction to the Theory of Statistics, 3rd ed. , 1974 .

[30]  Alfred O. Hero,et al.  Dual Rooted-Diffusions for Clustering and Classification on Manifolds , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[31]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1976, TOMS.

[32]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[33]  Jon Louis Bentley,et al.  Fast Algorithms for Constructing Minimal Spanning Trees in Coordinate Spaces , 1978, IEEE Transactions on Computers.

[34]  Michael I. Jordan,et al.  Learning Spectral Clustering, With Application To Speech Separation , 2006, J. Mach. Learn. Res..

[35]  Johan Warell,et al.  Asteroid taxonomic classification in the Gaia photometric system , 2007 .

[36]  Richard C. T. Lee,et al.  Experiments with some cluster analysis algorithms , 1974, Pattern Recognit..

[37]  N. Keshava,et al.  Distance metrics and band selection in hyperspectral processing with applications to material identification and spectral libraries , 2004, IEEE Transactions on Geoscience and Remote Sensing.

[38]  M. Basseville Distance measures for signal processing and pattern recognition , 1989 .

[39]  Ronald L. Graham,et al.  On the History of the Minimum Spanning Tree Problem , 1985, Annals of the History of Computing.

[40]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[41]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[42]  Werner Stuetzle,et al.  Estimating the Cluster Tree of a Density by Analyzing the Minimal Spanning Tree of a Sample , 2003, J. Classif..

[43]  William H. Press,et al.  Numerical Recipes 3rd Edition: The Art of Scientific Computing , 2007 .

[44]  Jon Louis Bentley,et al.  Quad trees a data structure for retrieval on composite keys , 1974, Acta Informatica.

[45]  Alfred O. Hero,et al.  Applications of entropic spanning graphs , 2002, IEEE Signal Process. Mag..

[46]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[47]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .