PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Online Regret Bounds for a New Reinforcement Learning Algorithm
Peter Auer and Ronald Ortner
(2005) Working Paper, University of Leoben.

This is the latest version of this eprint.


We present a new learning algorithm for undiscounted finite-step reinforcement learning with restarts. Unlike many other analyses in reinforcement learning we are not only interested in convergence properties of the presented UCB algorithm, but aim at bounds for its online performance after a finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation trade-off in multi-armed bandit problems, we use upper confidence bounds to show that the UCB algorithm achieves logarithmic online regret in respect to an optimal policy. The algorithm also performs well in experiments.

EPrint Type:Other
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Theory & Algorithms
ID Code:1885
Deposited By:Ronald Ortner
Deposited On:29 December 2005

Available Versions of this Item