Reinforcement Learning by Advantage Weighted Regression
Recently, batch mode reinforcement learning (BMRL) methods have become more popular due to their higher learning speed, more stable learning processes and the higher quality of the resulting policy. However, these methods remain hard to use for continuous action spaces which frequently occur in real-world control tasks, e.g., in robotics and in plant control. The greedy action selection commonly used in BMRL is particularly problematic as it is expensive for continuous actions, can cause an unstable learning process, introduces an optimization bias and results into highly non-smooth policies unsuitable for real-world systems. In this paper we offer an alternative approach to reinforcement learning where we aim at finding good smooth approximations of the optimal policy by reducing the standard reinforcement learning problem to an iterative advantage-weighted regression problem. The resulting algorithm naturally produces smooth continuous policies and outperforms current state of the art methods.