An Optimistic Posterior Sampling Strategy for Bayesian Reinforcement Learning

We consider the problem of decision making in the context of unknown Markov decision processes with finite state and action spaces. In a Bayesian reinforcement learning framework, we propose an optimistic posterior sampling strategy based on the maximization of state-action value functions of MDPs sampled from the posterior. First experiments are promising. Introduction. The design of algorithms for planning in the context of unknown Markov Decision Processes (MDPs) remains challenging. In particular, one of the main difficulties is to address the so-called Exploration versus Exploitation (E/E) dilemma: at every time-step, the algorithm must both (i) take a decision which is of good quality regarding information that has been collected so far (the exploitation part) and (ii) open the door to collecting new information about the (unknown) underlying environment in order to take better decisions in the future (the exploration part). At the end of the eighties, the popularization of Reinforcement Learning (RL) [20] gave a new impulse to the research community working on this old problem, and the E/E dilemma was re-discovered in the light of the RL paradigm. Among the approaches that have been proposed to address the E/E dilemma in the RL field, one can mention approaches based on optimism in the face of uncertainty [12, 3, 4, 13, 6, 15] and Bayesian approaches [7, 19, 17, 9, 8]. In the last few years, posterior sampling approaches have received a lot of attention, in particular for solving multi-armed bandits problems [5, 11, 10]. Very recently, posterior sampling has also been proved theoretically and empirically to be efficient for solving MDPs in [16]. Our contribution lies at the crossroads between posterior sampling approaches and optimistic approaches. We propose a strategy based on two main assumptions: (i) a posterior distribution can be maintained over the set of all possible transition models, and (ii) one can easily sample and solve MDPs drawn according to this posterior. These two conditions are easily satisfied in the context of finite state and action space MDPs. Inspired from the principle of the Bayes-UCB algorithm proposed in the context of multi-armed bandit problems [10], our strategy works as follows: at each time-step, a pool of MDPs is drawn from the posterior distribution, and each MDP is solved. We finally take an action whose value is maximized over the set of state-action value functions of sampled MDPs. After observing a new transition, the posterior distribution is updated according to the Bayes rule. We illustrate empirically the performances of our approach on a standard benchmark. Model-based Bayesian Reinforcement Learning. Let M = (S,A, T,R) be a Markov Decision Process (MDP), where the set S = { s, . . . , sS } denotes the finite state space and the set A = { a, . . . , aA } the finite action space of the MDP. When the MDP is in state st ∈ S at time t ∈ N, an action at ∈ A is selected and the MDP moves toward a new state st+1 ∈ S , drawn

[1]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[2]  Olivier Buffet,et al.  Near-Optimal BRL using Optimistic Local Transitions , 2012, ICML.

[3]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[4]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[5]  Peter Dayan,et al.  Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[6]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[7]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[8]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[9]  Richard L. Lewis,et al.  Variance-Based Rewards for Approximate Bayesian Reinforcement Learning , 2010, UAI.

[10]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[11]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[12]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[13]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[14]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[15]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[16]  Lucian Busoniu,et al.  Optimistic planning for belief-augmented Markov Decision Processes , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[17]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[18]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.