Maximum Entropy Modelling for Assessing Results on Real-Valued Data
Statistical assessment of the results of data mining is increasingly recognised as a core task in the knowledge discovery process. It is of key importance in practice, as results that might seem interesting at first glance can often be explained by well-known basic properties of the data. In pattern mining in particular, such trivial results may be so overwhelming in number that filtering them out is a necessity in order to identify the true interesting results. In this paper, we propose an approach for assessing results on real-valued rectangular databases. More specifically, using our analytical model we are able to statistically assess whether or not a discovered structure may be the trivial result of the row and column marginal distributions in the database. Our main approach is to use the Maximum Entropy princi-ple to fit a background model to the data while respecting its marginal distributions. To find these distributions, we employ an MDL based histogram estimator, and we fit these in our model using well-founded convex optimisation techniques. Subsequently, our model can be used to calculate probabilities directly, as well as to efficiently sample data with the purpose of assessing results by means of empirical hypothesis testing. Importantly, it is both efficient, parameter-free, and naturally deals with missing values. As such, it represents a well-founded alternative to swap randomisation.