Statistical asymptotic and non-asymptotic consistency of bayesian networks : convergence to the right structure and consistent probability estimates
Sylvain Gelly and Olivier Teytaud
Bayesian networks are a well known and powerful tool for representing and reasoning on uncertainty. One can refer to [PEA 00],[NAI 04] for a general introduction to bayesian networks. Learning the structure and the parameters of bayesian networks can be done through either expert information or data. Here, we only address the problem of learning from data, i.e. learning a law of probability given a set of examples distributed according to this law. Although a lot of algorithms exist for learning in bayesian networks from data, several problems remain. Furthermore, the use of learning theory for bayesian network is still far from complete.
First, when looking for a bayesian model, one can have different goals e.g. i) evaluating qualitatively some probabilites ; ii) evaluating expectations (of gain or loss). In the first case, evaluating a risk is roughly the question : does a given event happen with probability or ? Then, the use of logarithms, leading to maximum likelihood, is justified. In the second case, if we look for the expectation of (vector of possible values indexed by possible states), the approximation of the real probability vector by a probability vector leads to an error bounded (thanks to Cauchy-Schwartz inequality) by . Therefore, optimizing a criterion monotonous as a function of is the natural approach.
Second, when the postulated structure is not the right one, maximum likelyhood (frequency approach for probability estimation) leads to very unstable results. We then propose a non-standard and tractable loss function for bayesian networks and evidences of the relevance of this loss function.
Futhermore, the purpose of this paper is to provide some theoretical insights into the problems of learning bayesian networks. The use of statistical learning theory provides bounds on the number of examples needed to approximate the distribution for a given precision/confidence, depending upon some complexity measures ; using covering numbers, we show the influence of structural entropy, as a refinement of scores based on the number of parameters only. We also provide, among other things, an algorithm which is guaranteed to converge to an optimal (in size) structure as the number of i.i.d examples goes to infinity.
We also make comparisons between the form of our bound to the form of the different scores classically used on bayesian network structure learning.
The paper is organized as follows : in section 2 we present an overview of our most concrete results. In section 3 we briefly survey some classical ways to learn bayesian networks from data and discuss the contribution of this paper in regard of existing results. In section 4 we introduce formally the problem and the notations. Section 5 first recalls some classical results of learning theory and presents our result about evaluation of VC-dimensions and covering numbers. We then generalize our results to more general bayesian networks, with hidden variables, in section 5.3. Section 6 shows usefull corollaries applied to structure learning, parameters learning, universal consistency, and others. Section 7 presents algorithmic details. Section 8 presents empirical results.