Mining Tibetan-Chinese bilingual entities from wikipedia

Entity translation pairs play an important role in NLP applications, such as cross language information retrieval and machine translation. The named entity and domain entity are key factors that affect the performance of the system. However, the entity translations can hardly be found in the present bilingual dictionary or parallel corpus. There are lots of Tibetan new neologisms and named entities in Tibetan Wikipedia, and this paper proposes a new method to automatically mining method of Tibetan and Chinese bilingual entity translation from Wikipedia based on the language interlink and page feature. We construct an extract pattern of Tibetan and Chinese entity translation pairs gained from the previous work, and adopt multi-feature candidate translation pairs to distinguish the selection model. The results verify that the entity translation mining method can achieve high accuracy.

[1]  Zhengtao Yu,et al.  Research on the Extraction of Wikipedia-Based Chinese-Khmer Named Entity Equivalents , 2015, NLPCC.

[2]  Utpal Garain,et al.  Named Entity Recognition with Word Embeddings and Wikipedia Categories for a Low-Resource Language , 2017, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[3]  Francis M. Tyers,et al.  Extracting bilingual word pairs from Wikipedia , 2008 .

[4]  Ian H. Witten,et al.  Mining Domain-Specific Thesauri from Wikipedia: A Case Study , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[5]  Aniket Kittur,et al.  What's in Wikipedia?: mapping topics and conflict using socially annotated category structure , 2009, CHI.

[6]  Sun Changlong,et al.  The Translation Mining of the Out of Vocabulary Based on Wikipedia , 2011 .

[7]  Antonio Toral,et al.  A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia , 2006, Workshop On New Text Wikis And Blogs And Other Dynamic Text Sources.

[8]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[9]  Philippe Langlais,et al.  Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia. , 2011, BUCC@ACL.

[10]  Cai Rang-jia Tibetan corpus processing method , 2011 .

[11]  Jia Yangj A Hybrid Approach to Tibetan Person Name Identification by Maximum Entropy Model and Conditional Random Fields , 2014 .

[12]  Duan Jianyon Mining Translation Pairs with Learnt Patterns from Wikipedia , 2015 .