PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Discovering unexpected documents in corpora
François Jacquenet and Christine Largeron
Knowledge-Based Systems Volume 22, Number 6, pp. 421-429, 2009.

Abstract

Text mining is widely used to discover frequent patterns in large corpora of documents. Hence, many classical data mining techniques, that have been proven fruitful in the context of data stored in relational databases, are now successfully used in the context of textual data. Nevertheless, there are many situations where it is more valuable to discover unexpected information rather than frequent ones. In the context of technology watch for example, we may want to discover new trends in specific markets, or discover what competitors are planning in the near future, etc. This paper is related to that context of research. We have proposed several unexpectedness measures and implemented them in a prototype, called UnexpectedMiner, that can be used by watchers, in order to discover unexpected documents in large corpora of documents (patents, datasheets, advertisements, scientific papers, etc.). UnexpectedMiner is able to take into account the structure of documents during the discovery of unexpected information. Many experiments have been performed in order to validate our measures and show the interest of our system.

EPrint Type:Article
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Information Retrieval & Textual Information Access
ID Code:6730
Deposited By:François Jacquenet
Deposited On:08 March 2010