Summarization Techniques for Pattern Collections in Data Mining
PhD thesis, University of Helsinki.
Discovering patterns from data is an important task in data mining. There exist techniques to find large collections of many kinds of patterns from data very efficiently. A collection of patterns can be regarded as a summary of the data. A major difficulty with patterns is that pattern collections summarizing the data well are often very large.
In this dissertation we describe methods for summarizing pattern collections in order to make them also more understandable. More specifically, we focus on the following themes:
1) Quality value simplifications. We study simplifications of pattern collections based on simplifying the quality values of the patterns. Especially, we study simplification by discretization.
2) Pattern orderings. It is difficult to find a suitable trade-off between the accuracy of the representation and its size. As a solution to this problem, we suggest that patterns could be ordered in such a way that each prefix of the pattern ordering gives a good summary of the whole collection.
3) Pattern chains and antichains. Virtually all pattern collections have natural underlying partial orders. We exploit the partial orders over pattern collections by clustering the patterns into chains and antichains.
4) Change profiles. We describe how patterns can be related to each other by comparing how their quality values change with respect to their common neighborhoods, i.e., by comparing their change profiles.
5) Inverse pattern discovery. As the patterns are often used to summarize data, it is natural to ask whether the original data set can be deduced from the pattern collection. We study the computational complexity of such problems.