Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayes-optimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a significant margin on several well-known benchmark problems - because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration.

[1]  Robert E. Kalaba,et al.  On adaptive control processes , 1959 .

[2]  P. Randolph Bayesian Decision Problems and Markov Chains , 1968 .

[3]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[4]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[5]  Yoram Singer,et al.  Efficient Bayesian Parameter Estimation in Large Discrete Domains , 1998, NIPS.

[6]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[7]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[8]  Tamer Basar,et al.  Dual Control Theory , 2001 .

[9]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[10]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[11]  Paul Bourgine,et al.  Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty , 1999, Machine Learning.

[12]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[13]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[14]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[15]  Rémi Munos,et al.  Bandit Algorithms for Tree Search , 2007, UAI.

[16]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[17]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[18]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[19]  Richard L. Lewis,et al.  Variance-Based Rewards for Approximate Bayesian Reinforcement Learning , 2010, UAI.

[20]  Doina Precup,et al.  Smarter Sampling in Model-Based Bayesian Reinforcement Learning , 2010, ECML/PKDD.

[21]  Thomas J. Walsh,et al.  Integrating Sample-Based Planning and Model-Based Reinforcement Learning , 2010, AAAI.

[22]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[23]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[24]  Michael L. Littman,et al.  Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search , 2011, UAI.

[25]  Joelle Pineau,et al.  A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes , 2011, J. Mach. Learn. Res..

[26]  Michèle Sebag,et al.  The grand challenge of computer Go , 2012, Commun. ACM.

[27]  Peter Dayan,et al.  Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search , 2013, J. Artif. Intell. Res..