PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Near-optimal Regret Bounds for Reinforcement Learning
Peter Auer, Thomas Jaksch and Ronald Ortner
(2009) Technical Report. Chair for Informationtechnology, Leoben, Austria.


This technical report is an extended version of [1]. For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s' there is a policy which moves from s to s' in at most D steps (on average). We present a reinforcement learning algorithm with total regret ~O(DS sqrt(AT)) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Omega(sqrt(DSAT)) on the total regret of any learning algorithm.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Monograph (Technical Report)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Learning/Statistics & Optimisation
ID Code:4586
Deposited By:Thomas Jaksch
Deposited On:13 March 2009