Analysis of Optimistic Algorithms for the Exploration/Exploitation Trade-Off
In: Foundations of Computational Mathematics, Hongkong(2008).
We consider decision problems where repeatedly decisions, associated with some gain, need to be made. Making a decision, one may rely on the information collected so far (exploitation), or one may try to collect further information (exploration). The risk of an exploitative decision is that a better decision – with higher gain – is not recognized because of insufficient information. The risk of an explorative decision is its non-optimality, typically.
Optimistic algorithms deal with this exploration/exploitation trade-off implicitly, by assuming the most favourable gain process which is consistent with the information collected so far. Decisions are made based on this optimistic assumption.
In my talk I will show how such optimistic algorithms can be analysed, by examples for the bandit problem and for the reinforcement learning problem. While the generic part of these analyses is very similar, the more technical part needs to bound the distance between the optimistically assumed gain process and the actual gain process.