Learning Machine Translation in High-Dimensional Spaces
Reviewing representative studies on statistical machine translation, one is able to point out that the mainstream techniques in this field are essentially still probabilistic. Their main drawbacks are due to the sensitivity of the unreliable probability estimations to round-off errors and the lack of flexibility to handle high-dimensional features. This thesis investigates the application of purely-discriminative training methods that have demonstrated their successes in a broad range of natural language processing tasks to the problem of statistical machine translation. The first attempt is to kernelize the training process by modeling the translation problem as a linear mapping among source and target word chunks (n-grams of various length). This formulation yields a regression problem with vector outputs. A kernelized ridge regression model and a one-class classifier called maximum margin regression are explored for comparison, between which the former is proved to perform better on this task. For the large-scale training problem, two possible solutions based on blockwise matrix operation and locally linear regression hyperplane approximation via online relevant training examples subsetting are proposed respectively. Because of the computational complexities of the ridge regression model, the latter is more practical for the application of the proposed method in real-world translation tasks. In addition, we also introduce a novel way to integrate language models into this particular machine translation framework, which uses the language model as a penalty item in the objective function of the regression model, since its n-gram representation exactly matches the definition of our feature vectors. An alternative solution proposed here is a novel structured classification model that explicitly handles the high-dimensional discriminative joint feature vectors. The support vector machine style formulation of structured classification problems is slightly modified by regularizing the weight vector to be learned with the L1-norm. This yields a linear programming optimization and can be regarded as an approximate large-margin classifier. We argue that such a linear program can be solved iteratively using the column generation technique. Especially, if the extragradient method for linear programming is utilized, further efficiency can be achieved by properly selecting the starting point in every iteration based on the column generation. When compared to previous L2-regularized formulations, not only does this method scale better by a trade-off between effectiveness and efficiency, it also has the advantage of being able to accept more complex structures. Finally, a generalization error bound is derived for the proposed model.