PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Bilingual sentence matching using kernel CCA
Abhishek Tripathi, Arto Klami and Sami Virpioja
In: 2010 IEEE International Workshop on Machine Learning for Signal Processing, 29 Aug - 01 Sep 2010, Kittilä, Finland.


The problem of matching samples between two data sets is a fundamental task in unsupervised learning. In this paper we propose an algorithm based on statistical dependency between the data sets to solve the matching problem in a general case when samples in both data sets have different feature representations. As a concrete example, we consider the task of sentence-level alignment of parallel corpus based on monolingual data. Multilingual text collections with sentence-level alignment are required by statistical machine translation methods. We show how statistical dependencies between feature representations of partially aligned (e.g., paragraph-level alignment) corpora can be used to learn sentence-level alignment in a data-driven way. Our novel matching algorithm based on Kernel Canonical Correlation Analysis (KCCA) outperforms an earlier algorithm using linear CCA.

PDF - PASCAL Members only - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Natural Language Processing
Theory & Algorithms
ID Code:7608
Deposited By:Arto Klami
Deposited On:17 March 2011