|
Loose Phrase String Kernels AbstractWhen representing textual documents by feature vectors for the purposes of further processing (e.g. for categorization, clustering, or visualization), one possible representation is based on “loose phrases” (also known as “proximity features”). This is a generalization of n-grams: a loose phrase is considered to appear in a document if all the words from the phrase occur sufficiently close to each other. We describe a kernel that corresponds to the dot product of documents under a loose phrase representation. This kernel can be plugged into any kernel method to deal with documents in the loose phrase representation instead of the bag of words representation.
[Edit] |