An ensemble of transliteration models for information retrieval

Transliteration is used to phonetically translate proper names and technical terms especially from languages in Roman alphabets to languages in non-Roman alphabets such as from English to Korean, Japanese, and Chinese. Because transliterations are usually representative index terms for documents, proper handling of the transliterations is important for an effective information retrieval system. However, there are limitations on handling transliterations depending on dictionary lookup, because transliterations are usually not registered in the dictionary. For this reason, many researchers have been trying to overcome the problem using machine transliteration. In this paper, we propose a method for improving machine transliteration using an ensemble of three different transliteration models. Because one transliteration model alone has limitation on reflecting all possible transliteration behaviors, several transliteration models should be complementary used in order to achieve a high-performance machine transliteration system. This paper describes a method about transliteration production using the several machine transliteration models and transliteration ranking with web data and relevance scores given by each transliteration model. We report evaluation results for our ensemble transliteration model and experimental results for its impact on IR effectiveness. Machine transliteration tests on English-to-Korean transliteration and English-to-Japanese transliteration show that our proposed method achieves 78-80% word accuracy. Information retrieval tests on KTSET and NTCIR-1 test collection show that our transliteration model can improve the performance of an information retrieval system about 10-34%.

[1]  Key-Sun Choi,et al.  An English-Korean Transliteration Model Using Pronunciation and Contextual Rules , 2002, COLING.

[2]  Jae Sung Lee,et al.  English to Korean Statistical Transliteration for Information Retrieval , 2008 .

[3]  Noriko Kando,et al.  Overview of IR tasks , 1999, NTCIR.

[4]  Hsin-Hsi Chen,et al.  Backward Machine Transliteration by Learning Phonetic Similarity , 2002, CoNLL.

[5]  Tetsuya Ishikawa,et al.  Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration , 2001, Comput. Humanit..

[6]  Eric Brill,et al.  Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs , 2001, NLPRS.

[7]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[8]  Yaser Al-Onaizan,et al.  Translating Named Entities Using Monolingual and Bilingual Resources , 2002, ACL.

[9]  Naoto Kato,et al.  Transliteration Considering Context Information based on the Maximum Entropy Method , 2003 .

[10]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[11]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[12]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[13]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[14]  Jian Su,et al.  A Joint Source-Channel Model for Machine Transliteration , 2004, ACL.

[15]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[16]  Key-Sun Choi,et al.  Automatic Transliteration and Back-transliteration by Decision Tree Learning , 2000, LREC.

[17]  Hozumi Tanaka,et al.  Improving Back-Transliteration by Combining Information Sources , 2004, IJCNLP.

[18]  Jin-Shea Kuo,et al.  Generating Paired Transliterated-cognates Using Multiple Pronunciation Characteristics from Web corpora , 2004, PACLIC.

[19]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[20]  Tsujii Jun'ichi,et al.  Maximum entropy estimation for feature forests , 2002 .

[21]  Keita Tsuji Automatic Extraction of Translational Japanese-KATAKANA and English Word Pairs , 2002, Int. J. Comput. Process. Orient. Lang..

[22]  Walter Daelemans,et al.  TiMBL: Tilburg Memory-Based Learner, version 2.0, Reference guide , 1998 .

[23]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[24]  James P. Callan,et al.  Experiments Using the Lemur Toolkit , 2001, TREC.

[25]  In-Ho Kang,et al.  English-to-Korean Transliteration using Multiple Unbounded Overlapping Phoneme Chunks , 2000, COLING.

[26]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[27]  Sung-Hyon Myaeng,et al.  Automatic identification and back-transliteration of foreign words for information retrieval , 1999, Inf. Process. Manag..

[28]  강병주,et al.  한국어 정보검색에서 외래어와 영어로 인한 단어불일치문제의 해결 = A resolution of word mismatch problem caused by foreign word transliterations and english words in Korean information retrieval , 2001 .

[29]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[30]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[31]  Yiping Li,et al.  Translating Chinese Romanized Name into Chinese Idiographic Characters via Corpus and Web Validation , 2005, CORIA.

[32]  Zhang Le,et al.  Maximum Entropy Modeling Toolkit for Python and C , 2004 .