Trans-dimensional MCMC for Bayesian policy learning

A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.

[1]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[2]  R. Vollertsen,et al.  Burn-In , 1999, 1999 IEEE International Integrated Reliability Workshop Final Report (Cat. No. 99TH8460).

[3]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[4]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[5]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[6]  J. Bernardo,et al.  Simulation-Based Optimal Design , 1999 .

[7]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[8]  P. Green,et al.  Trans-dimensional Markov chain Monte Carlo , 2000 .

[9]  Noah J. Cowan,et al.  Efficient Gradient Estimation for Motor Control Learning , 2002, UAI.

[10]  Alexander J. Smola,et al.  Neural Information Processing Systems , 1997, NIPS 1997.

[11]  Rajesh P. N. Rao,et al.  Planning and Acting in Uncertain Environments using Probabilistic Inference , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12]  P. Green,et al.  Corrigendum: On Bayesian analysis of mixtures with an unknown number of components , 1997 .

[13]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[14]  Sebastian Thrun,et al.  Monte Carlo POMDPs , 1999, NIPS.

[15]  Geoffrey E. Hinton,et al.  Using EM for Reinforcement Learning , 2000 .

[16]  Marc Toussaint,et al.  Probabilistic inference for solving (PO) MDPs , 2006 .

[17]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[18]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[19]  Hagai Attias,et al.  Planning by Probabilistic Inference , 2003, AISTATS.

[20]  P. Müller,et al.  Optimal Bayesian Design by Inhomogeneous Markov Chain Simulation , 2004 .

[21]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[22]  Nando de Freitas,et al.  Fast particle smoothing: if I had a million particles , 2006, ICML.

[23]  Pascal Poupart,et al.  Point-Based Value Iteration for Continuous POMDPs , 2006, J. Mach. Learn. Res..