Unsupervised clustering on multi-component datasets: Applications on images and astrophysics data

This paper proposes an original approach to cluster multicomponent data sets with an estimation of the number of clusters. From the construction of a minimal spanning tree with Prim's algorithm and the assumption that the vertices are approximately distributed according to a Poisson distribution, the number of clusters is estimated by thresholding the Prim's trajectory. The corresponding cluster centroids are then computed in order to initialize the Generalized Lloyd's algorithm, also known as K-means, which allows to circumvent initialization problems. Metrics used for measuring similarity between multi-dimensional data points are based on symmetrical divergences. The use of these informational divergences together with the proposed method lead to better results than some other clustering methods in the framework of astrophysical data processing. An application of this method in the multi-spectral imagery domain with a satellite view of Paris is also presented.

[1]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[2]  Richard P. Binzel,et al.  Phase II of the Small Main-Belt Asteroid Spectroscopic Survey: The Observations , 2002 .

[3]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[4]  Alfred O. Hero,et al.  Applications of entropic spanning graphs , 2002, IEEE Signal Process. Mag..

[5]  Johan Warell,et al.  Asteroid taxonomic classification in the Gaia photometric system , 2007 .

[6]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[8]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[10]  Chein-I Chang,et al.  An information-theoretic approach to spectral variability, similarity, and discrimination for hyperspectral image analysis , 2000, IEEE Trans. Inf. Theory.

[11]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[12]  R. Prim Shortest connection networks and some generalizations , 1957 .

[13]  Alfred O. Hero,et al.  Dual Rooted-Diffusions for Clustering and Classification on Manifolds , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.