New inference strategies for solving Markov Decision Processes using reversible jump MCMC

In this paper we build on previous work which uses inferences techniques, in particular Markov Chain Monte Carlo (MCMC) methods, to solve parameterized control problems. We propose a number of modifications in order to make this approach more practical in general, higher-dimensional spaces. We first introduce a new target distribution which is able to incorporate more reward information from sampled trajectories. We also show how to break strong correlations between the policy parameters and sampled trajectories in order to sample more freely. Finally, we show how to incorporate these techniques in a principled manner to obtain estimates of the optimal policy.

[1]  Matt Hoffman,et al.  On Solving General State-Space Sequential Decision Problems using Inference Algorithms , 2007 .

[2]  Simon J. Godsill,et al.  Marginal maximum a posteriori estimation using Markov chain Monte Carlo , 2002, Stat. Comput..

[3]  Rajesh P. N. Rao,et al.  Planning and Acting in Uncertain Environments using Probabilistic Inference , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[4]  Gareth O. Roberts,et al.  Non-centred parameterisations for hierarchical models and data augmentation. , 2003 .

[5]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[6]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[7]  Hagai Attias,et al.  Planning by Probabilistic Inference , 2003, AISTATS.

[8]  P. Müller,et al.  Optimal Bayesian Design by Inhomogeneous Markov Chain Simulation , 2004 .

[9]  Nando de Freitas,et al.  Bayesian Policy Learning with Trans-Dimensional MCMC , 2007, NIPS.

[10]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[11]  Marc Toussaint,et al.  Probabilistic inference for solving (PO) MDPs , 2006 .

[12]  S. Vijayakumar,et al.  Planning and Moving in Dynamic Environments A Statistical Machine Learning Approach , 2008 .

[13]  James C. Spall,et al.  Introduction to Stochastic Search and Optimization. Estimation, Simulation, and Control (Spall, J.C. , 2007 .

[14]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[15]  Stefan Schaal,et al.  Reinforcement Learning for Operational Space Control , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[16]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[17]  Arnaud Doucet,et al.  On solving integral equations using Markov chain Monte Carlo methods , 2010, Appl. Math. Comput..

[18]  Nando de Freitas,et al.  An Expectation Maximization Algorithm for Continuous Markov Decision Processes with Arbitrary Reward , 2009, AISTATS.

[19]  Tim Hesterberg,et al.  Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control , 2004, Technometrics.

[20]  Marc Toussaint,et al.  Hierarchical POMDP Controller Optimization by Likelihood Maximization , 2008, UAI.

[21]  Marc Toussaint,et al.  Creating Brain-Like Intelligence: From Principles to Complex Intelligent Systems , 2009 .

[22]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[23]  J. Bernardo,et al.  Simulation-Based Optimal Design , 1999 .

[24]  M. J. Bayarri,et al.  Non-Centered Parameterisations for Hierarchical Models and Data Augmentation , 2003 .

[25]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Geoffrey E. Hinton,et al.  Using EM for Reinforcement Learning , 2000 .