## AbstractWe present an algorithm, Ucrl2, which we show to be nearly optimal by a new analysis of the "Optimism in the face of uncertainty" paradigm. We consider undiscounted rewards and bound the regret, which is the sum of missed rewards (also during learning!) compared to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s' there is a policy which moves from s to s' in at most D steps (on average). We provide the best known bounds for undiscounted reinforcement learning. The total regret of Ucrl2 is O(DS sqrt(AT log(T/delta) )) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability and corresponds to a PAC-like bound of Omega(D^2 S^2 A/epsilon^2 log(DSA/epsilon/delta) ) steps until the average per step regret is at most epsilon. We also present a lower bound of Omega(sqrt(DSAT)) on the total regret of any learning algorithm. These new bounds demonstrate the utility of the diameter as structural parameter of an MDP.
[Edit] |