PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Similarity Word-Sequence Kernels for Sentence Clustering
Jesús Andrés-Ferrer, Germán Sanchis-Trilles and Francisco Casacuberta
In: 8th International Workshop on Statistical Pattern Recognition, August 18--20, Cesme, Izmir, Turkey.


In this paper, we present a novel clustering approach based on the use of kernels as similarity functions and the C-means algorithm. Several word-sequence kernel are defined and extended to verify the properties of similarity functions. Afterwards, these monolingual word-sequence kernels are extended to bilingual word-sequence kernels, and applied to the task of monolingual and bilingual sentence clustering. The motivation of this proposal is to group similar sentences into clusters so that specialised models can be trained for each cluster, with the purpose of reducing in this way both the size and complexity of the initial task. We provide empirical evidence for proving that the use of bilingual kernels can lead to better clusters, in terms of intra-cluster perplexities.

EPrint Type:Conference or Workshop Item (Talk)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
Information Retrieval & Textual Information Access
ID Code:7429
Deposited By:Alfons Juan
Deposited On:17 March 2011