Least Squares Temporal Difference Methods: An Analysis Under General Conditions
We consider approximate policy evaluation for finite state and action Markov decision processes (MDP) with the least squares temporal difference algorithm, LSTD(lambda), in an exploration-enhanced off-policy learning context. We establish for the discounted cost criterion that the off-policy LSTD(lambda) converges almost surely under mild, minimal conditions. We also analyze other convergence and boundedness properties of the iterates involved in the algorithm. Our analysis draws on theories of both finite space Markov chains and weak Feller Markov chains on topological spaces. Our results can be applied to other temporal difference algorithms and MDP models. As examples, we give a convergence analysis of an off-policy TD(lambda) algorithm and extensions to MDP with compact action and state spaces.