Discovering unexpected documents in corpora
Text mining is widely used to discover frequent patterns in large corpora of documents. Hence, many classical data mining techniques, that have been proven fruitful in the context of data stored in relational databases, are now successfully used in the context of textual data. Nevertheless, there are many situations where it is more valuable to discover unexpected information rather than frequent ones. In the context of technology watch for example, we may want to discover new trends in specific markets, or discover what competitors are planning in the near future, etc. This paper is related to that context of research. We have proposed several unexpectedness measures and implemented them in a prototype, called UnexpectedMiner, that can be used by watchers, in order to discover unexpected documents in large corpora of documents (patents, datasheets, advertisements, scientific papers, etc.). UnexpectedMiner is able to take into account the structure of documents during the discovery of unexpected information. Many experiments have been performed in order to validate our measures and show the interest of our system.