|
Online Regret Bounds for a New Reinforcement Learning Algorithm This is the latest version of this eprint. AbstractWe present a new learning algorithm for undiscounted finite-step reinforcement learning with restarts. Unlike many other analyses in reinforcement learning we are not only interested in convergence properties of the presented UCB algorithm, but aim at bounds for its online performance after a finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation trade-off in multi-armed bandit problems, we use upper confidence bounds to show that the UCB algorithm achieves logarithmic online regret in respect to an optimal policy. The algorithm also performs well in experiments.
Available Versions of this Item
[Edit] |