PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Near-optimal Regret Bounds for Reinforcement Learning
Peter Auer, Thomas Jaksch and Ronald Ortner
In: ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 21 Advances in Neural Information Processing Systems , 21 . (2009) NIPS Foundation , pp. 89-96.


For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s1,s2 there is a policy which moves from s1 to s2 in at most D steps (on average). We present a reinforcement learning algorithm with total regret ~O(DS sqrt(AT)) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Omega( sqrt(DSAT) ) on the total regret of any learning algorithm.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Book Section
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Learning/Statistics & Optimisation
ID Code:4578
Deposited By:Thomas Jaksch
Deposited On:13 March 2009