PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Advances in Fully-Automatic and Interactive Phrase-Based Statistical Machine Translation
Daniel Ortiz-Martínez, Ismael García-Varea and Francisco Casacuberta
(2011) PhD thesis, Universidad Politecnica de Valencia.


This thesis presents different contributions in the fields of fully-automatic statistical machine translation and interactive statistical machine translation. In the field of statistical machine translation there are three problems that are to be addressed, namely, the modelling problem, the training problem and the search problem. In this thesis we present contributions regarding these three problems. Regarding the modelling problem, an alternative derivation of phrase-based statistical translation models is proposed. Such derivation introduces a set of statistical submodels governing different aspects of the translation process. In addition to this, the resulting submodels can be introduced as components of a log-linear model. Regarding the training problem, an alternative estimation technique for phrase-based models that tries to reduce the strong heuristic component of the standard estimation technique is proposed. The proposed estimation technique considers the phrase pairs that compose the phrase model as part of complete bisegmentations of the source and target sentences. We theoretically and empirically demonstrate that the proposed estimation technique can be efficiently executed. Experimental results obtained with the open-source THOT toolkit also presented in this thesis, show that the alternative estimation technique obtains phrase models with lower perplexity than those obtained by means of the standard estimation technique. However, the reduction in the perplexity of the model did not allow us to obtain improvements in the translation quality. To deal with the search problem, we propose a search algorithm which is based on the branch-and-bound search paradigm. The proposed algorithm generalises different search strategies that can be accessed bymodifying the input parameters. We carried out experiments to evaluate the performance of the proposed search algorithm.

EPrint Type:Thesis (PhD)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
ID Code:8797
Deposited By:Alfons Juan
Deposited On:21 February 2012