Integrating Clustering and Multi-Document Summarization by Bi-Mixture Probabilistic Latent Semantic Analysis (PLSA) with Sentence Bases

Probabilistic Latent Semantic Analysis (PLSA) has been popularly used in document analysis. However, as it is currently formulated, PLSA strictly requires the number of word latent classes to be equal to the number of document latent classes. In this paper, we propose Bi-mixture PLSA, a new formulation of PLSA that allows the number of latent word classes to be different from the number of latent document classes. We further extend Bi-mixture PLSA to incorporate the sentence information, and propose Bi-mixture PLSA with sentence bases (Bi-PLSAS) to simultaneously cluster and summarize the documents utilizing the mutual influence of the document clustering and summarization procedures. Experiments on real-world datasets demonstrate the effectiveness of our proposed methods.

[1]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[2]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[3]  Chris H. Q. Ding,et al.  Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence Chi-Square Statistic, and a Hybrid Method , 2006, AAAI.

[4]  C. Elkan,et al.  Topic Models , 2008 .

[5]  H. Akaike A new look at the statistical model identification , 1974 .

[6]  ChengXiang Zhai,et al.  Automatic labeling of multinomial topic models , 2007, KDD '07.

[7]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[8]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[9]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[10]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[11]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[12]  Xiaojun Wan,et al.  Manifold-Ranking Based Topic-Focused Multi-Document Summarization , 2007, IJCAI.

[13]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[14]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[15]  Chris H. Q. Ding,et al.  Knowledge transformation from word space to document space , 2008, SIGIR '08.

[16]  Chris H. Q. Ding,et al.  Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization , 2008, SIGIR '08.

[17]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[18]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .

[19]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[20]  Yihong Gong,et al.  Integrating clustering and multi-document summarization to improve document understanding , 2008, CIKM '08.

[21]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[22]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..