Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora

This paper investigates the potential for projecting linguistic annotations including part-of-speech tags and base noun phrase bracketings from one language to another via automatically word-aligned parallel corpora. First, experiments assess the accuracy of unmodified direct transfer of tags and brackets from the source language English to the target languages French and Chinese, both for noisy machine-aligned sentences and for clean hand-aligned sentences. Performance is then substantially boosted over both of these baselines by using training techniques optimized for very noisy data, yielding 94-96% core French part-of-speech tag accuracy and 90% French bracketing F-measure for stand-alone monolingual tools trained without the need for any human-annotated data in the given language.

[1]  Kenneth Ward Church,et al.  K-vec: A New Approach for Aligning Parallel Texts , 1994, COLING.

[2]  David Yarowsky,et al.  Language Independent, Minimally Supervised Induction of Lexical Probabilities , 2000, ACL.

[3]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[4]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[5]  Dekai Wu,et al.  Aligning a Parallel English-Chinese Corpus Statistically With Lexical Criteria , 1994, ACL.

[6]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[7]  Dekai Wu,et al.  An Algorithm for Simultaneously Bracketing Parallel Texts by Aligning Words , 1995, ACL.

[8]  Grace Ngai,et al.  Transformation Based Learning in the Fast Lane , 2001, NAACL.

[9]  Kenneth Ward Church Char_align: A Program for Aligning Parallel Texts at the Character Level , 1993, ACL.

[10]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[11]  Julian Kupiec,et al.  An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora , 1993, ACL.

[12]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[13]  Douglas A. Jones,et al.  Twisted pair grammar: support for rapid development of machine translation for low density languages , 1998, AMTA.

[14]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[15]  David Yarowsky,et al.  Minimally Supervised Morphological Analysis by Multimodal Alignment , 2000, ACL.

[16]  Kenneth Ward Church,et al.  Robust Bilingual Word Alignment for Machine Aided Translation , 1993, VLC@ACL.

[17]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.