PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Topic detection and tracking in a stream of documents
Blaz Novak
(2008) Other thesis, University of Ljubljana.

Abstract

A challenge created by the recent development in information technology is that people are often faced with an overwhelming amount of information available to them, with blogs presenting the latest and most abundant source of such information. In this thesis, I approach the problem from a standpoint of organizing the newly created information into sensible groups. The first part of the thesis is an overview of the state of the art in the areas relevant to the problem and an analysis of shortcomings of different methods. The main contribution is the development of a new algorithm that pieces together various ideas presented in the first part. It is an online hierarchical clustering algorithm that is capable of incremental model updates that support the addition and also the removal of documents. The structure of the model is adapted after each step to better reflect the structure of the currently observed world. The model can also be optimized while waiting for new events. Some experiments to test the properties of the new algorithm were performed using simulated data streams created from the Reuters Corpus Volume 1 dataset. I have found that the basic assumptions about time complexity and the ability to adapt the model are correct and that the algorithm performs surprisingly well for a range of different inputs.

EPrint Type:Thesis (Other)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
Theory & Algorithms
Information Retrieval & Textual Information Access
ID Code:5000
Deposited By:Jan Rupnik
Deposited On:24 March 2009