论文信息 - Enriching Multilingual Language Resources by Discovering Missing Cross-Language Links in Wikipedia

Enriching Multilingual Language Resources by Discovering Missing Cross-Language Links in Wikipedia

We present a novel method for discovering missing cross-language links between English and Japanese Wikipedia articles. We collect candidates of missing cross-language links -- a pair of English and Japanese Wikipedia articles, which could be connected by cross-language links. Then we select the correct cross-language links among the candidates by using a classifier trained with various types of features. Our method has three desirable characteristics for discovering missing links. First, our method can discover cross-language links with high accuracy (92\% precision with 78\% recall rates). Second, the features used in a classifier are language-independent. Third, without relying on any external knowledge, we generate the features based on resources automatically obtained from Wikipedia. In this work, we discover approximately $10^5$ missing cross-language links from Wikipedia, which are almost two-thirds as many as the existing cross-language links in Wikipedia.

[1] Valentin Jijkoun,et al. Overview of the WiQA Task at CLEF 2006 , 2006, CLEF.

[2] Thorsten Joachims,et al. Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[3] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[4] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[5] Jean-Michel Renders,et al. A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora , 2004, ACL.

[6] Maarten de Rijke,et al. Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[7] Kentaro Torisawa,et al. Exploiting Wikipedia as External Knowledge for Named Entity Recognition , 2007, EMNLP.

[8] Pascale Fung,et al. A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora , 1998, AMTA.

[9] Philipp Cimiano,et al. Enriching the crosslingual link structure of Wikipedia - A classification-based approach , 2008, AAAI 2008.

[10] Takahiro Hara,et al. A Bilingual Dictionary Extracted from the Wikipedia Link Structure , 2008, DASFAA.