An Expectation Maximization Algorithm for Continuous Markov Decision Processes with Arbitrary Reward

We derive a new expectation maximization algorithm for policy optimization in linear Gaussian Markov decision processes, where the reward function is parameterized in terms of a flexible mixture of Gaussians. This approach exploits both analytical tractability and numerical optimization. Consequently, on the one hand, it is more flexible and general than closed-form solutions, such as the widely used linear quadratic Gaussian (LQG) controllers. On the other hand, it is more accurate and faster than optimization methods that rely on approximation and simulation. Partial analytical solutions (though costly) eliminate the need for simulation and, hence, avoid approximation error. The experiments will show that for the same cost of computation, policy optimization methods that rely on analytical tractability have higher value than the ones that rely on simulation.

[1]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[2]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[3]  J. Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[4]  Sebastian Thrun,et al.  Monte Carlo POMDPs , 1999, NIPS.

[5]  Geoffrey E. Hinton,et al.  Using EM for Reinforcement Learning , 2000 .

[6]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[7]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[8]  Jan M. Maciejowski,et al.  Predictive control : with constraints , 2002 .

[9]  Frank L. Lewis,et al.  Robot Manipulator Control: Theory and Practice , 2003 .

[10]  Hagai Attias,et al.  Planning by Probabilistic Inference , 2003, AISTATS.

[11]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[12]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[13]  Pascal Poupart,et al.  Point-Based Value Iteration for Continuous POMDPs , 2006, J. Mach. Learn. Res..

[14]  Nando de Freitas,et al.  Fast particle smoothing: if I had a million particles , 2006, ICML.

[15]  Rajesh P. N. Rao,et al.  Planning and Acting in Uncertain Environments using Probabilistic Inference , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[17]  Marc Toussaint,et al.  Probabilistic inference for solving (PO) MDPs , 2006 .

[18]  Nando de Freitas,et al.  Bayesian Policy Learning with Trans-Dimensional MCMC , 2007, NIPS.

[19]  Stefan Schaal,et al.  Reinforcement Learning for Operational Space Control , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[20]  Arnaud Doucet,et al.  On solving integral equations using Markov chain Monte Carlo methods , 2010, Appl. Math. Comput..