PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Peter Auer and Ronald Ortner
In: Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006 (2006) MIT Press , pp. 49-56. ISBN 0-262-19568-2


We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm's online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy.

EPrint Type:Book Section
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Learning/Statistics & Optimisation
Theory & Algorithms
ID Code:3262
Deposited By:Ronald Ortner
Deposited On:04 February 2008