PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Dynamic Policy Programming
Mohammad Gheshlaghi Azar and Bert Kappen
Journal for Machine Learning Research Volume 15, Number CP15, pp. 1-26, 2011.


In this paper, we consider the problem of planning and learning in the infinite-horizon discounted-reward Markov decision problems. We propose a novel iterative direct policysearch approach, called dynamic policy programming (DPP). DPP is, to the best of our knowledge, the first convergent direct policy-search method that uses a Bellman-like iteration technique and at the same time is compatible with function approximation. For the tabular case, we prove that DPP converges asymptotically to the optimal policy. We numerically compare the performance of DPP to other state-of-the-art approximate dynamic programming methods on the mountain-car problem with linear function approximation and Gaussian basis functions. We observe that, unlike other approximate dynamic programming methods, DPP converges to a near-optimal policy, even when the basis functions are randomly placed. We conclude that DPP, combined with function approximation, asymptotically outperforms other approximate dynamic programming methods in the mountain-car problem. Keywords: Approximate dynamic programming, reinforcement learning, Markov decision processes, direct policy search, KL divergence.

PDF - PASCAL Members only - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Article
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Machine Vision
Learning/Statistics & Optimisation
ID Code:7043
Deposited By:Bert Kappen
Deposited On:03 February 2011