PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Classification-based Policy Iteration with a Critic
Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh and Bruno Scherer
In: Twenty-Eighth International Conference on Machine Learning (ICML-2011), Seattle, Washington, USA(2011).


In this paper, we study the effect of adding a value function approximation component (critic) to rollout classification-based policy iteration (RCPI) algorithms. The idea is to use a critic to approximate the return after we truncate the rollout trajectories. This al- lows us to control the bias and variance of the rollout estimates of the action-value function. Therefore, the introduction of a critic can improve the accuracy of the rollout esti- mates, and as a result, enhance the performance of the RCPI algorithm. We present a new RCPI algorithm, called direct policy iteration with critic (DPI-Critic), and provide its finite-sample analysis when the critic is based on the LSTD method. We empirically evaluate the performance of DPI-Critic and compare it with DPI and LSPI in two bench- mark reinforcement learning problems.

EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Theory & Algorithms
ID Code:9074
Deposited By:Mohammad Ghavamzadeh
Deposited On:21 February 2012