PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path
A Antos, Csaba Szepesvari and R Munos
In: COLT-06(2006).

Abstract

We consider batch reinforcement learning problems in con- tinuous space, expected total discounted-reward Markovian Decision Prob- lems. As opposed to previous theoretical work, we consider the case when the training data consists of a single sample path (trajectory) of some behaviour policy. In particular, we do not assume access to a genera- tive model of the environment. The algorithm studied is policy iteration where in successive iterations the Q-functions of the intermediate policies are obtained by means of minimizing a novel Bellman-residual type er- ror. PAC-style polynomial bounds are derived on the number of samples needed to guarantee near-optimal performance where the bound depends on the mixing rate of the trajectory, the smoothness properties of the underlying Markovian Decision Problem, the approximation power and capacity of the function set used.

EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Learning/Statistics & Optimisation
Theory & Algorithms
ID Code:6355
Deposited By:Csaba Szepesvari
Deposited On:08 March 2010