## AbstractCasting solving a known Markov Decision Problem as probabilistic inference on a related graphical model has the benet of opening the eld of MDPs to recent developments in approximate inference. In this paper we extend this framework to the reinforcement learning problem in which the transition and reward distributions are not given and need to be learned on the basis of interaction with the environment. A naive approach is to use a point estimate of the transition model. However, this does not reflect the uncertainty in the model of the environment and, as such, one cannot expect such a naive approach to form policies which maintain a degree of exploration. Instead, we suggest a Bayesian solution that maintains a posterior distribution over transition, which enables us to take account of the uncertainty in our knowledge of the transition model when planning. The resulting EM algorithm is formally intractable and we discuss two approximate solution methods, one based on variational Bayes and the other on expectation propagation.
[Edit] |