Optimistic Planning for Deterministic Systems
Jean Francois Hren and Rémi Munos
In: European Workshop on Reinforcement Learning, Lille(2008).
Abstract. If one possesses a model of a controlled deterministic system,
then from any state, one may consider the set of all possible reachable
states starting from that state and using any sequence of actions. This
forms a tree whose size is exponential in the planning time horizon.
Here we ask the question: given ﬁnite computational resources (e.g. CPU
time), which may not be known ahead of time, what is the best way
to explore this tree, such that once all resources have been used, the
algorithm would be able to propose an action (or a sequence of actions)
whose performance is as close as possible to optimality? The performance
with respect to optimality is assessed in terms of the regret (with respect
to the sum of discounted future rewards) resulting from choosing the
action returned by the algorithm instead of an optimal action. In this
paper we investigate an optimistic exploration of the tree, where the most
promising states are explored ﬁrst, and compare this approach to a naive
uniform exploration. Bounds on the regret are derived both for uniform
and optimistic exploration strategies. Numerical simulations illustrate
the beneﬁt of optimistic planning.