PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Adaptive bandits: Towards the best history-dependent strategy
Odalric-Ambrym Maillard and Rémi Munos
JMLR Workshop and Conference Proceedings Volume Volume 15: AISTATS 2011, pp. 570-578, 2011.


We consider multi-armed bandit games with possibly adaptive opponents. We introduce models \Theta of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e.~he provides rewards that are stochastic functions of equivalence classes defined by some model \theta^*\in\Theta . The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t.~the best strategy among all mappings from classes to actions (i.e.~the best history-class-based strategy) for the best model in \Theta . This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When \Theta=\{\theta\} , i.e.~only one model is considered, we derive \textit{tractable} algorithms achieving a \textit{tight} regret (at time T) bounded by \tilde O(\sqrt{TAC}) , where C is the number of classes of \theta . Now, when many models are available, all known algorithms achieving a nice regret O(\sqrt{T}) are unfortunately \textit{not tractable} and scale poorly with the number of models |\Theta| . Our contribution here is to provide {\em tractable} algorithms with regret bounded by T^{2/3}C^{1/3}\log(|\Theta|)^{1/2} .

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Article
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Theory & Algorithms
ID Code:8761
Deposited By:Odalric-Ambrym Maillard
Deposited On:21 February 2012