PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Loose Phrase String Kernels
Janez Brank
In: SiKDD 2006, 09 Oct 2006, Ljubljana, Slovenia.


When representing textual documents by feature vectors for the purposes of further processing (e.g. for categorization, clustering, or visualization), one possible representation is based on “loose phrases” (also known as “proximity features”). This is a generalization of n-grams: a loose phrase is considered to appear in a document if all the words from the phrase occur sufficiently close to each other. We describe a kernel that corresponds to the dot product of documents under a loose phrase representation. This kernel can be plugged into any kernel method to deal with documents in the loose phrase representation instead of the bag of words representation.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Theory & Algorithms
Information Retrieval & Textual Information Access
ID Code:2421
Deposited By:Blaz Fortuna
Deposited On:22 November 2006