Thompson Sampling is Asymptotically Optimal in General Environments

We discuss a variant of Thompson sampling for nonparametric reinforcement learning in a countable classes of general stochastic environments. These environments can be non-Markov, non-ergodic, and partially observable. We show that Thompson sampling learns the environment class in the sense that (1) asymptotically its value converges to the optimal value in mean and (2) given a recoverability assumption regret is sublinear.

[1]  Marcus Hutter,et al.  Self-Optimizing and Pareto-Optimal Policies in General Environments based on Bayes-Mixtures , 2002, COLT.

[2]  Laurent Orseau,et al.  Asymptotic non-learnability of universal agents with computable horizon functions , 2013, Theor. Comput. Sci..

[3]  R. Lathe Phd by thesis , 1988, Nature.

[4]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[5]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[6]  Jordan Stoyanov,et al.  Counterexamples in Probability , 1989 .

[7]  Benjamin Van Roy,et al.  Model-based Reinforcement Learning and the Eluder Dimension , 2014, NIPS.

[8]  R. Durrett Probability: Theory and Examples , 1993 .

[9]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[10]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[11]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[12]  D. Blackwell,et al.  Merging of Opinions with Increasing Information , 1962 .

[13]  Tor Lattimore,et al.  General time consistent discounting , 2014, Theor. Comput. Sci..

[14]  Phuong Nguyen,et al.  Competing with an Infinite Set of Models in Reinforcement Learning , 2013, AISTATS.

[15]  Shane Legg,et al.  Universal Intelligence: A Definition of Machine Intelligence , 2007, Minds and Machines.

[16]  Marcus Hutter,et al.  Discrete MDL Predicts in Total Variation , 2009, NIPS.

[17]  Marcus Hutter,et al.  Bad Universal Priors and Notions of Optimality , 2015, COLT.

[18]  Shie Mannor,et al.  Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[19]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[20]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[21]  Tor Lattimore,et al.  Asymptotically Optimal Agents , 2011, ALT.

[22]  Jordan Stoyanov,et al.  Counterexamples in Probability. , 1989 .

[23]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[24]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[25]  Tor Lattimore,et al.  Theory of general reinforcement learning , 2014 .

[26]  K. Pearson Biometrika , 1902, The American Naturalist.

[27]  Marcus Hutter,et al.  Rationality, optimism and guarantees in general reinforcement learning , 2015, J. Mach. Learn. Res..

[28]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[29]  Daniel A. Braun,et al.  A Minimum Relative Entropy Principle for Learning and Acting , 2008, J. Artif. Intell. Res..

[30]  Laurent Orseau,et al.  Universal Knowledge-Seeking Agents for Stochastic Environments , 2013, ALT.

[31]  Marcus Hutter,et al.  A Theory of Universal Artificial Intelligence based on Algorithmic Complexity , 2000, ArXiv.

[32]  Marcus Hutter General Discounting Versus Average Reward , 2006, ALT.

[33]  Marcus Hutter,et al.  Universal Artificial Intellegence - Sequential Decisions Based on Algorithmic Probability , 2005, Texts in Theoretical Computer Science. An EATCS Series.

[34]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.