Thompson Sampling for Learning Parameterized Markov Decision Processes

We consider reinforcement learning in parameterized Markov Decision Processes (MDPs), where the parameterization may induce correlation across transition probabilities or rewards. Consequently, observing a particular state transition might yield useful information about other, unobserved, parts of the MDP. We present a version of Thompson sampling for parameterized reinforcement learning problems, and derive a frequentist regret bound for priors over general parameter spaces. The result shows that the number of instants where suboptimal actions are chosen scales logarithmically with time, with high probability. It holds for prior distributions that put significant probability near the true model, without any additional, specific closed-form structure such as conjugate or product-form priors. The constant factor in the logarithmic scaling encodes the information complexity of learning the MDP in terms of the Kullback-Leibler geometry of the parameter space.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  H. Robbins,et al.  Boundary Crossing Probabilities for the Wiener Process and Sample Sums , 1970 .

[3]  G. Grimmett,et al.  Probability and random processes , 2002 .

[4]  P. R. Kumar,et al.  Optimal control of a queueing system with two heterogeneous servers , 1984 .

[5]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[6]  L. Tierney,et al.  Accurate Approximations for Posterior Moments and Marginal Densities , 1986 .

[7]  R. Agrawal,et al.  Asymptotically efficient adaptive allocation schemes for controlled Markov chains: finite parameter space , 1989 .

[8]  G. Koole A simple proof of the optimality of a threshold policy in a two-server queueing system , 1995 .

[9]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[10]  Apostolos Burnetas,et al.  Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[11]  Tj Sweeting,et al.  Invited discussion of A. R. Barron: Information-theoretic characterization of Bayes performance and the choice of priors in parametric and nonparametric problems , 1998 .

[12]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[13]  J. Ghosh,et al.  POSTERIOR CONSISTENCY OF DIRICHLET MIXTURES IN DENSITY ESTIMATION , 1999 .

[14]  A. V. D. Vaart,et al.  Convergence rates of posterior distributions , 2000 .

[15]  L. Wasserman,et al.  Rates of convergence of posterior distributions , 2001 .

[16]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[17]  Gábor Lugosi,et al.  Concentration Inequalities , 2008, COLT.

[18]  Ambuj Tewari,et al.  Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs , 2007, NIPS.

[19]  T. Lai,et al.  Pseudo-maximization and self-normalized processes , 2007, 0709.2233.

[20]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[21]  Shie Mannor,et al.  Efficient reinforcement learning in parameterized models: discrete parameters , 2008, VALUETOOLS.

[22]  Elizabeth L. Wilmer,et al.  Markov Chains and Mixing Times , 2008 .

[23]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[24]  R. Ramamoorthi,et al.  Remarks on consistency of posterior distributions , 2008, 0805.3248.

[25]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[26]  Jean-Yves Audibert,et al.  Regret Bounds and Minimax Policies under Partial Monitoring , 2010, J. Mach. Learn. Res..

[27]  Daniel A. Braun,et al.  A Minimum Relative Entropy Principle for Learning and Acting , 2008, J. Artif. Intell. Res..

[28]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[29]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[30]  Devavrat Shah,et al.  Computing the Stationary Distribution Locally , 2013, NIPS.

[31]  Tor Lattimore,et al.  The Sample-Complexity of General Reinforcement Learning , 2013, ICML.

[32]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[33]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[34]  Benjamin Van Roy,et al.  Eluder Dimension and the Sample Complexity of Optimistic Exploration , 2013, NIPS.

[35]  Rémi Munos,et al.  Thompson Sampling for 1-Dimensional Exponential Family Bandits , 2013, NIPS.

[36]  Benjamin Van Roy,et al.  Near-optimal Reinforcement Learning in Factored MDPs , 2014, NIPS.

[37]  Shie Mannor,et al.  Thompson Sampling for Complex Online Problems , 2013, ICML.

[38]  Csaba Szepesvári,et al.  Bayesian Optimal Control of Smoothly Parameterized Systems: The Lazy Posterior Sampling Algorithm , 2014, ArXiv.

[39]  Benjamin Van Roy,et al.  Model-based Reinforcement Learning and the Eluder Dimension , 2014, NIPS.