Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces.

[1]  L. Milne‐Thomson A Treatise on the Theory of Bessel Functions , 1945, Nature.

[2]  N. L. Johnson,et al.  Linear Statistical Inference and Its Applications , 1966 .

[3]  Calyampudi R. Rao,et al.  Linear Statistical Inference and Its Applications. , 1975 .

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  N. Fisher,et al.  Efficient Simulation of the von Mises Distribution , 1979 .

[6]  G. J. McLachlan,et al.  9 The classification and mixture maximum likelihood approaches to cluster analysis , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[7]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[8]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[9]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[10]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[11]  Nicholas I. Fisher,et al.  Statistical Analysis of Circular Data , 1993 .

[12]  P. Sprent,et al.  Statistical Analysis of Circular Data. , 1994 .

[13]  A. Wood Simulation of the von mises fisher distribution , 1994 .

[14]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[15]  Shun-ichi Amari,et al.  Information geometry of the EM and em algorithms for neural networks , 1995, Neural Networks.

[16]  Padhraic Smyth,et al.  Clustering Sequences with Hidden Markov Models , 1996, NIPS.

[17]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[18]  Yishay Mansour,et al.  An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering , 1997, UAI.

[19]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[20]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[21]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[23]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[24]  Piotr Indyk A sublinear time approximation scheme for clustering in metric spaces , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[25]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[26]  David M. Rocke,et al.  Some computational issues in cluster analysis with no a priori metric , 1999 .

[27]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[28]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[29]  David L. Dowe,et al.  MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions , 2000, Stat. Comput..

[30]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[31]  R. Sharan,et al.  CLICK: a clustering algorithm with applications to gene expression analysis. , 2000, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[32]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[33]  Ayhan Demiriz,et al.  Constrained K-Means Clustering , 2000 .

[34]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[35]  Inderjit S. Dhillon,et al.  Efficient Clustering of Very Large Document Collections , 2001 .

[36]  W. J. Whiten,et al.  Fitting Mixtures of Kent Distributions to Aid in Joint Set Identification , 2001 .

[37]  John Riedl,et al.  Item-based collaborative filtering recommendation algorithms , 2001, WWW '01.

[38]  K. Shimizu,et al.  PEARSON TYPE VII DISTRIBUTIONS ON SPHERES , 2002 .

[39]  Byron Dom,et al.  An Information-Theoretic External Cluster-Validity Measure , 2002, UAI.

[40]  Samuel Kaski,et al.  Clustering Based on Conditional Distributions in an Auxiliary Space , 2002, Neural Computation.

[41]  Joydeep Ghosh,et al.  Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[42]  Inderjit S. Dhillon,et al.  Iterative clustering of high dimensional text data augmented by local search , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[43]  Ian T. Jolliffe,et al.  Fitting mixtures of von Mises distributions: a case study involving sudden infant death syndrome , 2003, Comput. Stat. Data Anal..

[44]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[45]  Inderjit S. Dhillon,et al.  Diametrical clustering for identifying anti-correlated gene clusters , 2003, Bioinform..

[46]  I. Dhillon,et al.  Modeling Data using Directional Distributions , 2003 .

[47]  Joydeep Ghosh,et al.  A Unified Framework for Model-based Clustering , 2003, J. Mach. Learn. Res..

[48]  Shi Zhong,et al.  A Comparative Study of Generative Models for Document Clustering , 2003 .

[49]  John Langford,et al.  PAC-MDL Bounds , 2003, COLT.

[50]  Daphne Koller,et al.  Decomposing Gene Expression into Cellular Processes , 2002, Pacific Symposium on Biocomputing.

[51]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[52]  G. Sumara,et al.  A Probabilistic Functional Network of Yeast Genes , 2004 .

[53]  Joydeep Ghosh,et al.  Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres , 2004, IEEE Transactions on Neural Networks.

[54]  John Langford,et al.  An objective evaluation criterion for clustering , 2004, KDD.

[55]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[56]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[57]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..