Sparse CCA for Bilingual Word Generation
Proposed by Hotelling 1936, Canonical Correlation Analysis (CCA) is a technique for finding pairs of basis vectors that maximises the correlation between a set of paired variables. The set of paired variables can be considered as two views of the same object, a perspective we adopt throughout. The disadvantage of CCA and similar statistical methods is that the learned projections are a linear combination of all the features in the primal representation or in the dual representation. This makes the interpretation of the solutions difficult. Studies by Zou et. al 2004 and Moghaddam et. al 2006 have addressed this issue for Principle Component Analysis (PCA) by learning only the relevant features that maximise the variance. We introduce a new convex least square variant of CCA which seeks a semantic projection that uses as few relevant features as possible to explain as much correlation as possible. In previous studies, CCA had either been formulated in the primal or dual (kernel) representation for both views. These formulations, coupled with the need for sparsity, could prove insufficient when one desires or is limited to a primal-dual representation, i.e. one wishes to learn the correlation of words in one language that map to documents in another. We address these possible scenarios by giving SCCA in a primal-dual framework in which one view is represented in the primal and the other in the dual (kernel defined) representation. We proceed to transform the initial formulation into a semi-supervised method by considering the second view as the vector of inner products between the query and training documents, now searching for a sparse primal view such that the correlation between the primal-view training data is maximised with the dual-view train-test inner product vector. The method is demonstrated on two paired corpuses of English-French and English-Spanish for a word generation task. In this task we generate a sparse set of words in the paired language such that the correlation is maximised. We compare our proposed method to a bilingual word generation task using conventional kernel CCA and observe that we are able to obtain an increase in performance of 20%-25% in the average percentage. The main advantage of our proposed method is that it does not need the full semantic information to produces the sparse weight representations. In turn, the sparse representation automatically drives the number of words used as the retrieved set, whereas a heuristic threshold is required in KCCA.