Aligning Sentences from Standard Wikipedia to Simple Wikipedia

This work improves monolingual sentence alignment for text simplification, specifically for text in standard and simple Wikipedia. We introduce a method that improves over past efforts by using a greedy (vs. ordered) search over the document and a word-level semantic similarity score based on Wiktionary (vs. WordNet) that also accounts for structural similarity through syntactic dependencies. Experiments show improved performance on a hand-aligned set, with the largest gain coming from structural similarity. Resulting datasets of manually and automatically aligned sentence pairs are made available.

[1]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[2]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[3]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[4]  Chris Callison-Burch,et al.  Bootstrapping Parallel Corpora , 2003, ParallelTexts@NAACL-HLT.

[5]  Regina Barzilay,et al.  Sentence Alignment for Monolingual Comparable Corpora , 2003, EMNLP.

[6]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[7]  Pascale Fung,et al.  Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E , 2004, EMNLP.

[8]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[9]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[10]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[11]  Stuart M. Shieber,et al.  Towards Robust Context-Sensitive Sentence Alignment for Monolingual Corpora , 2006, EACL.

[12]  Mari Ostendorf,et al.  Text simplification for language learners: a corpus analysis , 2007, SLaTE.

[13]  Rada Mihalcea,et al.  Text-to-Text Semantic Similarity for Automatic Short Answer Grading , 2009, EACL.

[14]  Mari Ostendorf,et al.  Analysis of vocabulary difficulty using Wiktionary , 2009, SLaTE.

[15]  Iryna Gurevych,et al.  A Monolingual Tree-based Translation Model for Sentence Simplification , 2010, COLING.

[16]  Wen-tau Yih,et al.  Adaptive near-duplicate detection via similarity learning , 2010, SIGIR.

[17]  Cristian Danescu-Niculescu-Mizil,et al.  For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia , 2010, NAACL.

[18]  Mark Dredze,et al.  Learning Simple Wikipedia: A Cogitation in Ascertaining Abecedarian Language , 2010, HLT-NAACL 2010.

[19]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[20]  David Kauchak,et al.  Simple English Wikipedia: A New Text Simplification Task , 2011, ACL.

[21]  Mirella Lapata,et al.  WikiSimple: Automatic Simplification of Wikipedia Articles , 2011, AAAI.

[22]  Emiel Krahmer,et al.  Sentence Simplification by Monolingual Machine Translation , 2012, ACL.

[23]  Ali Farhadi,et al.  Semantic Understanding of Professional Soccer Commentaries , 2012, UAI.

[24]  Mari Ostendorf,et al.  Graph-based algorithms for lexical semantics and its applications , 2012 .

[25]  Weiwei Guo,et al.  Modeling Sentences in the Latent Space , 2012, ACL.

[26]  David Kauchak,et al.  Improving Text Simplification Language Modeling Using Unsimplified Text Data , 2013, ACL.

[27]  Luke S. Zettlemoyer,et al.  Joint Coreference Resolution and Named-Entity Linking with Multi-Pass Sieves , 2013, EMNLP.

[28]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[29]  Ali Farhadi,et al.  Multi-Resolution Language Grounding with Weak Supervision , 2014, EMNLP.

[30]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[31]  Oren Etzioni,et al.  Learning to Solve Arithmetic Word Problems with Verb Categorization , 2014, EMNLP.

[32]  Ali Farhadi,et al.  Discriminative and consistent similarities in instance-level Multiple Instance Learning , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).