PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Novel statistical approaches to text classification, machine translation and computer-assisted translation
Jorge Civera
(2008) PhD thesis, Universidad Politécnica de Valencia.

Abstract

This thesis presents diverse contributions in the fields of text classification, machine translation and computer-assisted translation under the statistical framework. In text classification, a new application called bilingual text classification is proposed together with a series of models to capture bilingual information. To this purpose two main approaches were presented, the first of them is based on a naive crosslingual-independent assumption and the second, on a more sophisticated crosslingual word-correlation framework. As far as the naive assumption is concerned, five unigram models and smoothed n-gram languages models are introduced. These models were evaluated on three tasks of increasing complexity, considering the most complex of these tasks under the viewpoint of a bilingual machine-aided indexing application. The crosslingual word-correlation framework is represented by bilingual models that integrate a translation model. In our case this model is the well-known M1 translation model in conjunction with a unigram model. This model was tested on two of the simpler previously mentioned tasks superseding the naive approximation. In machine translation, the statistical word-alignment translation models M1, M2 and HMM are extended under the mixture modelling approach in order to define context-specific translation models. In the case of the M2 model, a mixture extension of an already existing iterative dynamic-programming search algorithm for the M2 model is also defined. This search algorithm allows us to directly assess the translation quality of the M2 mixture model on a semi-artificial controlled task, obtaining statistically significant improvements over the conventional M2 model. Moreover, an extensive experimental evaluation of these three models is carried out on two well-known shared tasks. These two tasks are used to assess on the one hand, the quality of the alignments obtained as a byproduct of the M1, M2 and HMM models and on the other hand, the translation quality of a statistical phrase-based system seeded with these alignments. As a result of this evaluation we observed that the M2 mixture model offered statistically significant betterment in alignment quality with respect to the conventional M2 model. In addition, the evaluation of translation quality brought to light slight, but systematic improvements in translation quality for all three models, achieving state-of-the-art results for the HMM mixture model. Finally, an interactive and predictive computer-assisted translation system based on stochastic finite-state transducers is presented. This system integrates well-known efficient error-correcting and n-best parsing algorithms that are adapted and implemented in order to guarantee low response time, while preserving adequate translation quality. The system was automatically tested on two corpora devoted to technical user manuals and bulletins of the European Union. The former corpus served as a bedtest for a thoroughly manual evaluation performed by translation agencies involved in the European project TransType2. Both, automatic and manual evaluations reported a significant reduction in typing effort, speeding up the translation process, and achieving so, the final goal of computer-assisted translation systems.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Thesis (PhD)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
ID Code:4479
Deposited By:Jorge Civera
Deposited On:13 March 2009