Loose Phrase String Kernels
In: SiKDD 2006, 09 Oct 2006, Ljubljana, Slovenia.
When representing textual documents by feature
vectors for the purposes of further processing (e.g.
for categorization, clustering, or visualization), one
possible representation is based on “loose phrases”
(also known as “proximity features”). This is a generalization
of n-grams: a loose phrase is considered
to appear in a document if all the words from the
phrase occur sufficiently close to each other. We
describe a kernel that corresponds to the dot product
of documents under a loose phrase representation.
This kernel can be plugged into any kernel
method to deal with documents in the loose phrase
representation instead of the bag of words representation.