Online Regret Bounds for a New Reinforcement Learning Algorithm
This is the latest version of this eprint.
We present a new learning algorithm for undiscounted finite-step reinforcement learning with restarts. Unlike many other analyses in reinforcement learning we are not only interested in convergence properties of the presented UCB algorithm, but aim at bounds for its online performance after a finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation trade-off in multi-armed bandit problems, we use upper confidence bounds to show that the UCB algorithm achieves logarithmic online regret in respect to an optimal policy. The algorithm also performs well in experiments.
Available Versions of this Item