Statistical approaches for natural language modelling and monotone statistical machine translation
Jesús Andrés-Ferrer, Alfons Juan and Francisco Casacuberta
PhD thesis, Universidad Politécnica de Valencia.
This thesis gathers some contributions to statistical pattern
recognition and, more specifically, to several natural language
processing (NLP) tasks. Several well-known statistical techniques are
revisited in this thesis: parameter estimation, loss function
design and probability modelling. The former techniques are applied
to several NLP tasks such as text classification (TC), language
modelling (LM) and statistical machine translation (SMT).
In parameter estimation, we tackle the smoothing problem by proposing
a constrained domain maximum likelihood estimation (CDMLE) technique.
The CDMLE avoids the need of the smoothing stage that makes the
maximum likelihood estimation (MLE) to lose its good theoretical
properties. This technique is applied to text classification by mean
of the Naive Bayes classifier. Afterwards, the CDMLE technique is
extended to leaving-one-out MLE and, then, applied to LM
smoothing. The results obtained in several LM tasks reported an
improvement in terms of perplexity compared with the standard
Concerning the loss function, we carefully study the design of loss
functions different from the 0-1 loss. We focus our study on
those loss functions that while retaining a similar decoding
complexity than the 0-1 loss function, provide more flexibility.
Many candidate loss functions are presented and analysed in several
statistical machine translation tasks and for several translation
models. We also analyse some outstanding translations rules such as
the direct translation rule; and we give a further insight into
the log-linear models, which are, in fact, particular cases of
Finally, several monotone translation models are proposed based on
well-known modelling techniques. Firstly, an extension to the GIATI
technique is proposed to infer finite state transducers
(FST). Afterwards, a phrased-based monotone translation model inspired
in hidden Markov models is proposed. Lastly, a phrased-based hidden
semi-Markov model is introduced. The latter model produces slightly
improvements over the baseline under some circumstances.