Minimax policies for adversarial and stochastic bandits
Jean-Yves Audibert and Sébastien Bubeck
We fill in a long open gap in the characterization of
the minimax rate for the multi-armed bandit problem.
Concretely, we remove an extraneous logarithmic
factor in the previously known upper bound
and propose a new family of randomized algorithms
based on an implicit normalization, as well as a
new analysis. We also consider the stochastic case,
and prove that an appropriate modification of the
upper confidence bound policy UCB1 (Auer et al.,
2002) achieves the distribution-free optimal rate
while still having a distribution-dependent rate logarithmic
in the number of plays.