Towards the best history-dependent strategy ## AbstractWe consider multi-armed bandit games with possibly adaptive opponents. We introduce models Î˜ of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes defined by some model Î¸âˆ— âˆˆ Î˜. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in Î˜. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When Î˜ = Î¸, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time T) bounded by O(sqrt(TAC)), where C is the number of classes of Î¸. Now, when many models are avail- able, all known algorithms achieving a nice regret O(sqrt{ T }) are unfortunately not tractable and scale poorly with the number of models |Î˜|. Our contribution here is to provide tractable algorithms with regret bounded by T^{2/3}C^{1/3}log(|Î˜|)^{1/2}
[Edit] |