PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Statistical approaches for natural language modelling and monotone statistical machine translation
Jesús Andrés-Ferrer, Alfons Juan and Francisco Casacuberta
(2010) PhD thesis, Universidad Politécnica de Valencia.

Abstract

This thesis gathers some contributions to statistical pattern recognition and, more specifically, to several natural language processing (NLP) tasks. Several well-known statistical techniques are revisited in this thesis: parameter estimation, loss function design and probability modelling. The former techniques are applied to several NLP tasks such as text classification (TC), language modelling (LM) and statistical machine translation (SMT). In parameter estimation, we tackle the smoothing problem by proposing a constrained domain maximum likelihood estimation (CDMLE) technique. The CDMLE avoids the need of the smoothing stage that makes the maximum likelihood estimation (MLE) to lose its good theoretical properties. This technique is applied to text classification by mean of the Naive Bayes classifier. Afterwards, the CDMLE technique is extended to leaving-one-out MLE and, then, applied to LM smoothing. The results obtained in several LM tasks reported an improvement in terms of perplexity compared with the standard smoothing techniques. Concerning the loss function, we carefully study the design of loss functions different from the 0-1 loss. We focus our study on those loss functions that while retaining a similar decoding complexity than the 0-1 loss function, provide more flexibility. Many candidate loss functions are presented and analysed in several statistical machine translation tasks and for several translation models. We also analyse some outstanding translations rules such as the direct translation rule; and we give a further insight into the log-linear models, which are, in fact, particular cases of loss functions. Finally, several monotone translation models are proposed based on well-known modelling techniques. Firstly, an extension to the GIATI technique is proposed to infer finite state transducers (FST). Afterwards, a phrased-based monotone translation model inspired in hidden Markov models is proposed. Lastly, a phrased-based hidden semi-Markov model is introduced. The latter model produces slightly improvements over the baseline under some circumstances.

PDF - PASCAL Members only - Requires Adobe Acrobat Reader or other PDF viewer.
Other (Bibtex File)
EPrint Type:Thesis (PhD)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Learning/Statistics & Optimisation
Natural Language Processing
ID Code:5578
Deposited By:Alfons Juan
Deposited On:08 March 2010