PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Solving deterministic policy (PO)MPDs using Expectation-Maximisation and Antifreeze
David Barber and Tom Furmston
European Conference on Machine Learning (LEMIR workshop) pp. 50-64, 2009.

Abstract

The viewpoint of solving Markov Decision Processes and their partially observable extension refers to finding policies that maximise the expected reward. We follow the rephrasing of this problem as learning in a related probabilistic model. Our trans-dimensional distribution formulation obtains equivalent results to previous work in the infinite horizon case and also rigorously handles the nite horizon case without discounting. In contrast to previous expositions, our framework elides auxiliary variables, simplifying the algorithm development. For any MDP the optimal policy is deterministic, meaning that this important case needs to be dealt with explicitly. Whilst this case has been discussed by previous authors, their treatment has not been formally equivalent to an EM algorithm, but rather based on a xed point iteration analogous to policy iteration. In contrast we derive a true EM approach for this case and show that this has a signicantly faster convergence rate than non-deterministic EM. Our approach extends naturally to the POMDP case as well. In the special case of deterministic environments, standard EM algorithms break down and we show how this can be addressed using a convex combination of the original deterministic environment and a fictitious stochastic `antifreeze' environment.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Article
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Learning/Statistics & Optimisation
Theory & Algorithms
ID Code:6097
Deposited By:David Barber
Deposited On:08 March 2010