Near-optimal Regret Bounds for Reinforcement Learning
Thomas Jaksch, Ronald Ortner and Peter Auer
Journal of Machine Learning Research
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps (on average). We present a reinforcement learning algorithm with total regret Õ(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of Ω(√DSAT) on the total regret of any learning algorithm is given as well.
These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T.
Finally, we also consider a setting where the MDP is allowed to change a fixed number of l times. We present a modification of our algorithm that is able to deal with this setting and show a regret bound of Õ(l1/3T2/3DS√A).