Finding surprising patterns in textual data streams
Tristan Snowsill, florent nicart, Marco Stefani, Tijl De Bie and Nello Cristianini
CIP2010: The 2nd International Workshop on Cognitive Information Processing
We address the task of detecting surprising patterns in large textual data streams. These can reveal events in the real world when the data streams are generated by online news media, emails, Twitter feeds, movie subtitles, scientific publications, and more. The volume of interest in such text streams often exceeds human capacity for analysis, such that automatic pattern recognition tools are indispensable.
In particular, we are interested in surprising changes in the frequency of n-grams of words, or more generally of symbols from an unlimited alphabet size. Despite the exponentially large number of possible n-grams in the size of the alphabet (which is itself unbounded), we show how these can be detected efficiently. To this end, we rely on a data structure known as a generalised suffix tree, which is additionally annotated with a limited amount of statistical information. Crucially, we show how the generalised suffix tree as well as these statistical annotations can efficiently be updated in an on-line fashion.