Asymptotic Performance Guarantee for Online Reinforcement Learning with the Least-Squares Regression
We introduce a new online reinforcement learning algorithm, the least-squares action-preference learning (LS-APL), for stochastic infinite-horizon Markov decision processes. We provide a non-trivial asymptotic performance loss bound with probability 1 for LS-APL. This result holds if the stochastic Markov process induced by the learning policy satisfies some certain mixing assumption. The bound mainly differs from the existing ones as it applies to problems with limited sampling budget per iteration. To illustrate the applicability of LS-APL we asses its performance on the optimal replacement problem.