论文信息 - Thompson Sampling is Asymptotically Optimal in General Environments - 字舞流文

Thompson Sampling is Asymptotically Optimal in General Environments

We discuss a variant of Thompson sampling for nonparametric reinforcement learning in a countable classes of general stochastic environments. These environments can be non-Markov, non-ergodic, and partially observable. We show that Thompson sampling learns the environment class in the sense that (1) asymptotically its value converges to the optimal value in mean and (2) given a recoverability assumption regret is sublinear.

Laurent Orseau | Tor Lattimore | Marcus Hutter | Jan Leike | Tor Lattimore | Marcus Hutter | Laurent Orseau | J. Leike

[1] Marcus Hutter,et al. Self-Optimizing and Pareto-Optimal Policies in General Environments based on Bayes-Mixtures , 2002, COLT.

[2] Laurent Orseau,et al. Asymptotic non-learnability of universal agents with computable horizon functions , 2013, Theor. Comput. Sci..

[3] R. Lathe. Phd by thesis , 1988, Nature.

[4] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[5] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[6] Jordan Stoyanov,et al. Counterexamples in Probability , 1989 .

[7] Benjamin Van Roy,et al. Model-based Reinforcement Learning and the Eluder Dimension , 2014, NIPS.

[8] R. Durrett. Probability: Theory and Examples , 1993 .

[9] Stuart J. Russell,et al. Bayesian Q-Learning , 1998, AAAI/IAAI.

[10] Rémi Munos,et al. Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[11] Benjamin Van Roy,et al. (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[12] D. Blackwell,et al. Merging of Opinions with Increasing Information , 1962 .

[13] Tor Lattimore,et al. General time consistent discounting , 2014, Theor. Comput. Sci..

[14] Phuong Nguyen,et al. Competing with an Infinite Set of Models in Reinforcement Learning , 2013, AISTATS.

[15] Shane Legg,et al. Universal Intelligence: A Definition of Machine Intelligence , 2007, Minds and Machines.

[16] Marcus Hutter,et al. Discrete MDL Predicts in Total Variation , 2009, NIPS.

[17] Marcus Hutter,et al. Bad Universal Priors and Notions of Optimality , 2015, COLT.

[18] Shie Mannor,et al. Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[19] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[20] Lihong Li,et al. A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[21] Tor Lattimore,et al. Asymptotically Optimal Agents , 2011, ALT.

[22] Jordan Stoyanov,et al. Counterexamples in Probability. , 1989 .

[23] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[24] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[25] Tor Lattimore,et al. Theory of general reinforcement learning , 2014 .

[26] K. Pearson. Biometrika , 1902, The American Naturalist.

[27] Marcus Hutter,et al. Rationality, optimism and guarantees in general reinforcement learning , 2015, J. Mach. Learn. Res..

[28] Shipra Agrawal,et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[29] Daniel A. Braun,et al. A Minimum Relative Entropy Principle for Learning and Acting , 2008, J. Artif. Intell. Res..

[30] Laurent Orseau,et al. Universal Knowledge-Seeking Agents for Stochastic Environments , 2013, ALT.

[31] Marcus Hutter,et al. A Theory of Universal Artificial Intelligence based on Algorithmic Complexity , 2000, ArXiv.

[32] Marcus Hutter. General Discounting Versus Average Reward , 2006, ALT.

[33] Marcus Hutter,et al. Universal Artificial Intellegence - Sequential Decisions Based on Algorithmic Probability , 2005, Texts in Theoretical Computer Science. An EATCS Series.

[34] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.