Dirichlet Process Mixture Model for Document Clustering with Feature Partition

Finding the appropriate number of clusters to which documents should be partitioned is crucial in document clustering. In this paper, we propose a novel approach, namely DPMFP, to discover the latent cluster structure based on the DPM model without requiring the number of clusters as input. Document features are automatically partitioned into two groups, in particular, discriminative words and nondiscriminative words, and contribute differently to document clustering. A variational inference algorithm is investigated to infer the document collection structure as well as the partition of document words at the same time. Our experiments indicate that our proposed approach performs well on the synthetic data set as well as real data sets. The comparison between our approach and state-of-the-art document clustering approaches shows that our approach is robust and effective for document clustering.

[1]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[3]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[4]  James Kelly,et al.  AutoClass: A Bayesian Classification System , 1993, ML.

[5]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[6]  Guan Yu,et al.  Document clustering via dirichlet process mixture model with feature selection , 2010, KDD.

[7]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[8]  Shi Zhong,et al.  Semi-supervised model-based document clustering: A comparative study , 2006, Machine Learning.

[9]  Charles Elkan,et al.  Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution , 2006, ICML.

[10]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[11]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[12]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[13]  P. Green,et al.  Modelling Heterogeneity With and Without the Dirichlet Process , 2001 .

[14]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[15]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[16]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[17]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  H. Bozdogan Determining the Number of Component Clusters in the Standard Multivariate Normal Mixture Model Using Model-Selection Criteria. , 1983 .

[20]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[21]  Padhraic Smyth,et al.  Model selection for probabilistic clustering using cross-validated likelihood , 2000, Stat. Comput..

[22]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[23]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[24]  H. Ishwaran,et al.  Exact and approximate sum representations for the Dirichlet process , 2002 .