PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Visualization of text document corpus
Blaz Fortuna, Dunja Mladenić and Marko Grobelnik
In: SIKDD 2005 at multiconference IS 2005, 17 Oct 2005, Ljubljana, Slovenia.


From the automated text processing point of view, natural language is very redundant in the sense that many different words share a common or similar meaning. For computer this can be hard to understand without some background knowledge. Latent Semantic Indexing (LSI) is a technique that helps in extracting some of this background knowledge from corpus of text documents. This can be also viewed as extraction of hidden semantic concepts from text documents. On the other hand visualization can be very helpful in data analysis, for instance, for finding main topics that appear in larger sets of documents. Extraction of main concepts from documents using techniques such as LSI, can make the results of visualizations more useful. For example, given a set of descriptions of European Research projects (6FP) one can find main areas that these projects cover including semantic web, e-learning, security, etc. In this paper we describe a method for visualization of document corpus based on LSI, the system implementing it and give results of using the system on several datasets.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:User Modelling for Computer Human Interaction
Information Retrieval & Textual Information Access
ID Code:1197
Deposited By:Blaz Fortuna
Deposited On:24 November 2005