UCB Revisited: Improved Regret Bounds for the Stochastic Multi-Armed Bandit Problem
In the stochastic multi-armed bandit problem we consider a modification of the UCB algorithm of Auer et al. . For this modified algorithm we give an improved bound on the regret with respect to the optimal reward. While for the original UCB algorithm the regret in K-armed bandits after T trials is bounded by const K log(T)/Delta, where Delta measures the distance between a suboptimal arm and the optimal arm, for the modified UCB algorithm we show an upper bound on the regret of const K log (T/Delta^2) / Delta.