PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming
Dimitri Bertsekas and Huizhen Yu
(2010) Technical Report. University of Helsinki.


We consider the classical nite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for nding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm requires (possibly inexact) solution of a nonlinear system of equations, involving estimates of state costs as well as Q-factors. This is Bellman's equation for an optimal stopping problem that can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modied policy iteration, with lower overhead and/or more reliable convergence advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal dierence implementations are used, our algorithm resolves eectively the inherent diculties of existing schemes due to inadequate exploration.

EPrint Type:Monograph (Technical Report)
Additional Information:Also as LIDS report 2831, M.I.T.
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Theory & Algorithms
ID Code:8068
Deposited By:Huizhen Yu
Deposited On:17 March 2011