Better Optimism By Bayes: Adaptive Planning with Rich Models

The computational costs of inference and planning have confined Bayesian model-based reinforcement learning to one of two dismal fates: powerful Bayes-adaptive planning but only for simplistic models, or powerful, Bayesian non-parametric models but using simple, myopic planning strategies such as Thompson sampling. We ask whether it is feasible and truly beneficial to combine rich probabilistic models with a closer approximation to fully Bayesian planning. First, we use a collection of counterexamples to show formal problems with the over-optimism inherent in Thompson sampling. Then we leverage state-of-the-art techniques in efficient Bayes-adaptive planning and non-parametric Bayesian methods to perform qualitatively better than both existing conventional algorithms and Thompson sampling on two contextual bandit-like problems.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  J. J. Martin Bayesian Decision Problems and Markov Chains , 1967 .

[3]  J. Moon Random walks on random trees , 1973, Journal of the Australian Mathematical Society.

[4]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[5]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[6]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[7]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[8]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[9]  Erik B. Sudderth Graphical models for visual object recognition and tracking , 2006 .

[10]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[11]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[12]  Joelle Pineau,et al.  Model-Based Bayesian Reinforcement Learning in Large Structured Domains , 2008, UAI.

[13]  Finale Doshi-Velez,et al.  The Infinite Partially Observable Markov Decision Process , 2009, NIPS.

[14]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[15]  John Langford,et al.  Agnostic active learning , 2006, J. Comput. Syst. Sci..

[16]  Alessandro Lazaric,et al.  Bayesian Multi-Task Reinforcement Learning , 2010, ICML.

[17]  Joshua B. Tenenbaum,et al.  Nonparametric Bayesian Policy Priors for Reinforcement Learning , 2010, NIPS.

[18]  Peter Stone,et al.  Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-like Exploration , 2010, ECML/PKDD.

[19]  Yee Whye Teh,et al.  Bayesian Nonparametric Models , 2010, Encyclopedia of Machine Learning.

[20]  Yee Whye Teh,et al.  Dirichlet Process , 2017, Encyclopedia of Machine Learning and Data Mining.

[21]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[22]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[23]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[24]  Thomas L. Griffiths,et al.  The Indian Buffet Process: An Introduction and Review , 2011, J. Mach. Learn. Res..

[25]  M. Littman,et al.  Approaching Bayes-optimalilty using Monte-Carlo tree search , 2011 .

[26]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[27]  Leslie Pack Kaelbling,et al.  Bayesian Policy Search with Policy Priors , 2011, IJCAI.

[28]  Wei Chu,et al.  An unbiased offline evaluation of contextual bandit algorithms with generalized linear models , 2011 .

[29]  Peter Dayan,et al.  Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[30]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[31]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[32]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[33]  Peter Dayan,et al.  Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search , 2013, J. Artif. Intell. Res..