## AbstractIn this paper we consider an extension of the multiarmed bandit problem. In this generalized setting, the decision maker receives some side information, performs an action chosen from a finite set and then receives a reward. Unlike in the standard bandit settings, performing an action takes a random period of time. The environment is assumed to be stationary, stochastic and memoryless. The goal is to maximize the average reward received in one unit time, that is, to maximize the average rate of return. We consider the on-line learning problem where the decision maker initially does not know anything about the environment but must learn about it by trial and error. We propose an “upper confidence bound”-style algorithm that exploits the structure of the problem. We show that the regret of this algorithm relative to the optimal algorithm that has perfect knowledge about the problem grows with the number of decisions n at a rate ˜O(log n), that is, only slightly faster than the optimal logarithmic rate, and scales polynomially with the parameters of the problem.
[Edit] |