论文信息 - Statistical pattern recognition approaches for retrieval-based machine translation systems

Statistical pattern recognition approaches for retrieval-based machine translation systems

This dissertation addresses the problem of Machine Translation (MT), which is defined as an automated translation of a document written in one language (the source language) to another (the target language) by a computer. The MT task requires various types of knowledge of both the source and target language, e.g., linguistic rules and linguistic exceptions. Traditional MT systems rely on an extensive parsing strategy to decode the linguistic rules and use a knowledge base to encode those linguistic exceptions. However, the construction of the knowledge base becomes an issue as the translation system grows. To overcome this difficulty, real translation examples are used instead of a manually-crafted knowledge base. This design strategy is known as the Example-Based Machine Translation (EBMT) principle. Traditional EBMT systems utilize a database of word or phrase translation pairs. The main challenge of this approach is the difficulty of combining the word or phrase translation units into a meaningful and fluent target text. A novel Retrieval-Based Machine Translation (RBMT) system, which uses a sentence-level translation unit, is proposed in this study. An advantage of using the sentence-level translation unit is that the boundary of a sentence is explicitly defined and the semantic, or meaning, is precise in both the source and target language. The main challenge of using a sentential translation unit is the limited coverage, i.e., the difficulty of finding an exact match between a user query and sentences in the source database. Using an electronic dictionary and a topic modeling procedure, we develop a procedure to obtain clusters of sensible variations for each example in the source database. The coverage of our MT system improves because an input query text is matched against a cluster of sensible variations of translation examples instead of being matched against an original source example. In addition, pattern recognition techniques are used to improve the matching procedure, i.e., the design of optimal pattern classifiers and the incorporation of subjective judgments. A high performance statistical pattern classifier is used to identify the target sentences from an input query sentence in our MT system. The proposed classifier is different from the conventional classifier in terms of the way it addresses the generalization capability. A conventional classifier addresses the generalization issue using the parsimony principle and may encounter the possibility of choosing an oversimplified statistical model. The proposed classifier directly addresses the generalization issue in terms of training (empirical) data. Our classifier is expected to generalize better than the conventional classifiers because our classifier is less likely to use oversimplified statistical models based on the available training data. We further improve the matching procedure by the incorporation of subjective judgments. We formulate a novel cost function that combines subjective judgments and the degree of matching between translation examples and an input query. In addition, we provide an optimization strategy for the novel cost function so that the statistical model can be optimized according to the subjective judgments.

Dwi Sianto Mansjur | Biing-Hwang Juang | B. Juang

[1] Makoto Nagao,et al. A framework of a mechanical translation between Japanese and English by analogy principle , 1984 .

[2] Philip J. Hayes,et al. CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories , 1990, IAAI.

[3] J. Blum. Multidimensional Stochastic Approximation Methods , 1954 .

[4] Dwi Sianto Mansjur,et al. Incremental learning of mixture models for simultaneous estimation of class distribution and inter-class decision boundaries , 2008, 2008 19th International Conference on Pattern Recognition.

[5] Taro Watanabe,et al. A corpus-centered approach to spoken language translation , 2003, EACL.

[6] Czech Technical,et al. Optimization Algorithms for Kernel Methods , 2005 .

[7] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[8] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[9] Philipp Koehn,et al. Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[10] Fred Popowich,et al. What is example-based machine translation? , 2001, MTSUMMIT.

[11] Philip J. Hayes,et al. TCS: a shell for content-based text categorization , 1990, Sixth Conference on Artificial Intelligence for Applications.