Bayesian Actor Critic: A Bayesian Model for Value Function Approximation and Policy Learning
In this paper, we present a Bayesian take on the actor-critic architecture. The proposed Bayesian actor-critic (BAC) model uses a Bayesian class of non-parametric critics based on the Gaussian process temporal-difference learning. Such critics model the action-value function as a Gaussian process, allowing Bayes' rule to be used in computing a posterior distribution over action-value functions, conditioned on the observed data. The Bayesian actor in BAC uses the posterior distribution over action-value functions computed by the critic, and derives a posterior distribution for the gradient of the average discounted return with respect to the policy parameters. Appropriate choices of prior covariance (kernel) between state-action values that make action-value function compatible with the parametric family of policies, allow us to obtain closed-form expressions for the posterior distribution of the policy gradient. The posterior mean serves as our estimate of the gradient and is used to update the policy, while the posterior covariance allows us to gauge the reliability of the update.