## AbstractIn this paper we propose a novel gradient al- gorithm to learn a policy from an expert's observed behavior assuming that the expert behaves optimally with respect to some un- known reward function of a Markovian De- cision Problem. The algorithm's aim is to find a reward function such that the resulting optimal policy matches well the expert's ob- served behavior. The main difficulty is that the mapping from the parameters to poli- cies is both nonsmooth and highly redun- dant. Resorting to subdifferentials solves the first difficulty, while the second one is over- come by computing natural gradients. We tested the proposed method in two artificial domains and found it to be more reliable and e±cient than some previous methods.
[Edit] |