Competitive Reinforcement Learning
In: Models of Behavioural Learning Workshop (at NIPS 2005), 10 Dec 2005, Whistler, Canada.
We present a new algorithm for undiscounted reinforcement learning. In contrast to the usual convergence analysis we bound the regret of our algorithm: we compare the total reward received by our algorithm during learning with the total reward of an optimal strategy. In fact, we do not distinguish a specific learning phase but bound the regret of our algorithm for any number of steps. We are able to show that the regret scales logarithmically with the number of steps – as for the much simpler bandit problem. Methodologically, we use upper confidence bounds on the expected total reward to tackle the exploration-exploitation trade-off which the online reinforcement algorithm is facing.