## AbstractWe consider the online decision problem where an agent interacts with unpredictable and possibly adversarial environment. The goal of the agent is to maximize his long-term average reward subject to long-term average cost constraints. As is well known, without constraints, there exist a number of online algorithms that have the no-regret property, in the sense that they guarantee a long-term average reward as high as could be achieved by any fixed action of the agent, given the observed sequence of the environment's actions. We refer to the latter as the best-response envelope. In the constrained setting, we propose a relaxed form of the best-response envelope as the reference level for a no-regret algorithm. This relaxed best-response envelope incorporates a vector of relaxation parameters; we characterize the minimal value of this parameters vector which ensures that the relaxed best-response envelope is attainable, while satisfying long-term cost constraints. A computationally feasible algorithm, Constrained Regret Matching (CRM), is proposed and analyzed. In addition, an adaptive variant of the CRM algorithm is introduced, which tunes the relaxation parameters according to the observed actions of the environment.
[Edit] |