|
Competitive Reinforcement Learning AbstractWe present a new algorithm for undiscounted reinforcement learning. In contrast to the usual convergence analysis we bound the regret of our algorithm: we compare the total reward received by our algorithm during learning with the total reward of an optimal strategy. In fact, we do not distinguish a specific learning phase but bound the regret of our algorithm for any number of steps. We are able to show that the regret scales logarithmically with the number of steps – as for the much simpler bandit problem. Methodologically, we use upper confidence bounds on the expected total reward to tackle the exploration-exploitation trade-off which the online reinforcement algorithm is facing.
[Edit] |