PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Online Markov decision processes under bandit feedback
Gergely Neu, Andras Gyorgy, Csaba Szepesvari and András Antos
In: Twenty-Fourth Annual Conference on Neural Information Processing Systems 2010, 6-9 Dec 2010, Vancouver, B.C., Canada.

There is a more recent version of this eprint available. Click here to view it.


We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O(T^{2/3} (ln T)^{1/3}), giving the first rigorously proved regret bound for the problem.

PDF - PASCAL Members only - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Spotlight)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Theory & Algorithms
ID Code:7682
Deposited By:András Antos
Deposited On:09 March 2011

Available Versions of this Item