PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Near-optimal Regret Bounds for Reinforcement Learning
Peter Auer, Thomas Jaksch and Ronald Ortner
(2007) Working Paper, University of Leoben.


For undiscounted reinforcement learning we consider the total regret of a learning algorithm in respect to an optimal policy. We present a reinforcement learning algorithm with total regret O( DS sqrt(AT) ) after T steps for any unknown Markov decision process (MDP) with S states, A actions per state, and diameter D. The diameter of an MDP is at most D if for any pair of states s1, s2 there is a policy which moves from s1 to s2 in at most D steps (on average). Our upper bound holds with high probability and it can be converted into a logarithmic regret bound, if a fixed difference between the average reward of the optimal policy and the second optimal policy is assumed. We also present a corresponding lower bound Omega( sqrt(DSAT) ) on the worst case total regret of any learning algorithm.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Other
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Learning/Statistics & Optimisation
ID Code:3174
Deposited By:Thomas Jaksch
Deposited On:03 January 2008