Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization

Multi-document summarization aims to create a compressed summary while retaining the main characteristics of the original set of documents. Many approaches use statistics and machine learning techniques to extract sentences from documents. In this paper, we propose a new multi-document summarization framework based on sentence-level semantic analysis and symmetric non-negative matrix factorization. We first calculate sentence-sentence similarities using semantic analysis and construct the similarity matrix. Then symmetric matrix factorization, which has been shown to be equivalent to normalized spectral clustering, is used to group sentences into clusters. Finally, the most informative sentences are selected from each group to form the summary. Experimental results on DUC2005 and DUC2006 data sets demonstrate the improvement of our proposed framework over the implemented existing summarization systems. A further study on the factors that benefit the high performance is also conducted.

[1]  Kathleen McKeown,et al.  Cut and Paste Based Text Summarization , 2000, ANLP.

[2]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[3]  Doug Arnold,et al.  Machine Translation: An Introductory Guide , 1994 .

[4]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[5]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[6]  Sanda M. Harabagiu,et al.  Topic themes for multi-document summarization , 2005, SIGIR '05.

[7]  Hua Li,et al.  Document Summarization Using Conditional Random Fields , 2007, IJCAI.

[8]  Massih-Reza Amini,et al.  The use of unlabeled data to improve supervised learning for text summarization , 2002, SIGIR '02.

[9]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[10]  Tao Li,et al.  A general model for clustering binary data , 2005, KDD '05.

[11]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[12]  Hugh E. Williams,et al.  Fast generation of result snippets in web search , 2007, SIGIR.

[13]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Susanne Heizmann,et al.  Review of Machine translation: an introductory guide by D. Arnold, L. Balkan, R. Lee Humphreys, S. Meijer, and L. Sadler. NCC Blackwell 1994. , 1995 .

[15]  Hongyuan Zha,et al.  Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering , 2002, SIGIR '02.

[16]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[17]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[18]  G. Sampath,et al.  A Multilevel Text Processing Model of Newsgroup Dynamics , 2002, NLDB.

[19]  Xiaojun Wan,et al.  Manifold-Ranking Based Topic-Focused Multi-Document Summarization , 2007, IJCAI.

[20]  Tsutomu Hirao An Extrinsic Evaluation for Question-Biased Text Summarization on QA tasks , 2001 .

[21]  Rada Mihalcea,et al.  A Language Independent Algorithm for Single and Multiple Document Summarization , 2005, IJCNLP.

[22]  Dragomir R. Radev,et al.  LexPageRank: Prestige in Multi-Document Text Summarization , 2004, EMNLP.

[23]  Mark T. Maybury,et al.  Automatic Summarization , 2002, Computational Linguistics.

[24]  Dianne P. O'Leary,et al.  Text summarization via hidden Markov models , 2001, SIGIR '01.

[25]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[26]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[27]  R RadevDragomir,et al.  Centroid-based summarization of multiple documents , 2004 .

[28]  Dragomir R. Radev,et al.  Introduction to the Special Issue on Summarization , 2002, CL.

[29]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[30]  Sun Park,et al.  Multi-document Summarization Based on Cluster Using Non-negative Matrix Factorization , 2007, SOFSEM.

[31]  Joshua Goodman,et al.  Multi-Document Summarization by Maximizing Informative Content-Words , 2007, IJCAI.

[32]  Daniel Marcu,et al.  Summarization beyond sentence extraction: A probabilistic approach to sentence compression , 2002, Artif. Intell..

[33]  Eduard H. Hovy,et al.  From Single to Multi-document Summarization , 2002, ACL.

[34]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[35]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[36]  Chin-Yew Lin,et al.  From Single to Multi-document Summarization : A Prototype System and its Evaluation , 2002 .