A Monte-Carlo AIXI Approximation

This paper introduces a principled approach for the design of a scalable general reinforcement learning agent. Our approach is based on a direct approximation of AIXI, a Bayesian optimality notion for general reinforcement learning agents. Previously, it has been unclear whether the theory of AIXI could motivate the design of practical algorithms. We answer this hitherto open question in the affirmative, by providing the first computationally feasible approximation to the AIXI agent. To develop our approximation, we introduce a new Monte-Carlo Tree Search algorithm along with an agent-specific extension to the Context Tree Weighting algorithm. Empirically, we present a set of encouraging results on a variety of stochastic and partially observable domains. We conclude by proposing a number of directions for future research.

[1]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[2]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[3]  Raphail E. Krichevsky,et al.  The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[4]  JORMA RISSANEN,et al.  A universal data compression system , 1983, IEEE Trans. Inf. Theory.

[5]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[6]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[7]  Wray L. Buntine,et al.  A theory of learning classification rules , 1990 .

[8]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[9]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[10]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[11]  Frans M. J. Willems,et al.  Context Tree Weighting : A Sequential Universal Source Coding Procedure for Fsmx Sources , 1993, Proceedings. IEEE International Symposium on Information Theory.

[12]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[13]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[14]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[15]  Robert E. Schapire,et al.  Predicting Nearly As Well As the Best Pruning of a Decision Tree , 1995, COLT '95.

[16]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[17]  F. Willems,et al.  A study of the context tree maximizing method , 1995 .

[18]  Robert E. Schapire,et al.  Predicting Nearly as Well as the Best Pruning of a Decision Tree , 1995, COLT.

[19]  Frans M. J. Willems,et al.  Context weighting for general finite-context sources , 1996, IEEE Trans. Inf. Theory.

[20]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[21]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[22]  Akira Hayashi,et al.  A Bayesian Approach to Model Learning in Non-Markovian Environments , 1997, ICML.

[23]  Jürgen Schmidhuber,et al.  Discovering Neural Nets with Low Kolmogorov Complexity and High Generalization Capability , 1997, Neural Networks.

[24]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[25]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[26]  H. Chipman,et al.  Bayesian CART Model Search , 1998 .

[27]  Akira Hayashi,et al.  A Reinforcement Learning Algorithm in Partially Observable Environments Using Short-Term Memory , 1998, NIPS.

[28]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[29]  Frans M. J. Willems,et al.  The Context-Tree Weighting Method : Extensions , 1998, IEEE Trans. Inf. Theory.

[30]  Dimitri P. Bertsekas,et al.  Rollout Algorithms for Stochastic Scheduling Problems , 1999, J. Heuristics.

[31]  Adrian E. Raftery,et al.  Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors , 1999 .

[32]  Hendrik Blockeel,et al.  Top-Down Induction of First Order Logical Decision Trees , 1998, AI Commun..

[33]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[34]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[35]  Stefan Kramer,et al.  Inducing classification and regression trees in first order logic , 2001 .

[36]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[37]  Ofi rNw8x'pyzm,et al.  The Speed Prior: A New Simplicity Measure Yielding Near-Optimal Computable Predictions , 2002 .

[38]  Jürgen Schmidhuber,et al.  Bias-Optimal Incremental Problem Solving , 2002, NIPS.

[39]  Marcus Hutter The Fastest and Shortest Algorithm for all Well-Defined Problems , 2002, Int. J. Found. Comput. Sci..

[40]  F. Willems,et al.  Reflections on “ The Context-Tree Weighting Method : Basic Properties ” , 2002 .

[41]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[42]  Marcus Hutter,et al.  Self-Optimizing and Pareto-Optimal Policies in General Environments based on Bayes-Mixtures , 2002, COLT.

[43]  John W. Lloyd,et al.  Learning Comprehensible Theories from Structured Data , 2002, Machine Learning Summer School.

[44]  John W. Lloyd Logic for learning - learning comprehensible theories from structured data , 2003, Cognitive Technologies.

[45]  J. W. Lloyd,et al.  Logic for Learning , 2003, Cognitive Technologies.

[46]  J. W. Lloyd Logic and Learning , 2003 .

[47]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[48]  Yoram Singer,et al.  An Efficient Extension to Mixture Techniques for Prediction and Decision Trees , 1997, COLT '97.

[49]  Michael R. James,et al.  Predictive State Representations: A New Theory for Modeling Dynamical Systems , 2004, UAI.

[50]  Jürgen Schmidhuber,et al.  Optimal Ordered Problem Solver , 2002, Machine Learning.

[51]  Jürgen Schmidhuber,et al.  Shifting Inductive Bias with Success-Story Algorithm, Adaptive Levin Search, and Incremental Self-Improvement , 1997, Machine Learning.

[52]  Guy Shani,et al.  Resolving Perceptual Aliasing In The Presence Of Noisy Sensors , 2004, NIPS.

[53]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[54]  Ran El-Yaniv,et al.  On Prediction Using Variable Order Markov Models , 2004, J. Artif. Intell. Res..

[55]  Sebastian Thrun,et al.  Learning low dimensional predictive representations , 2004, ICML.

[56]  Shane Legg,et al.  Ergodic MDPs Admit Self-Optimising Policies , 2004 .

[57]  Richard S. Sutton,et al.  Temporal-Difference Networks , 2004, NIPS.

[58]  Marcus Hutter,et al.  Defensive Universal Learning with Experts , 2005, ALT.

[59]  Marcus Hutter Simulation Algorithms for Computational Systems Biology , 2017, Texts in Theoretical Computer Science. An EATCS Series.

[60]  C. S. Wallace,et al.  Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics) , 2005 .

[61]  Marcus Hutter,et al.  Universal Learning of Repeated Matrix Games , 2005, ArXiv.

[62]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[63]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[64]  Charles Lee Isbell,et al.  Looping suffix tree-based inference of partially observable hidden state , 2006, ICML.

[65]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[66]  Ran El-Yaniv,et al.  Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition , 2006, J. Mach. Learn. Res..

[67]  M. Kohler Wallace CS: Statistical and inductive inference by minimum message length , 2006 .

[68]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[69]  Sylvain Gelly,et al.  Exploration exploitation in Go: UCT for Monte-Carlo Go , 2006, NIPS 2006.

[70]  Olivier Teytaud,et al.  Modification of UCT with Patterns in Monte-Carlo Go , 2006 .

[71]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[72]  John W. Lloyd,et al.  Learning Modal Theories , 2007, ILP.

[73]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[74]  Guy Shani,et al.  Learning and Solving Partially Observable Markov Decision Processes , 2007 .

[75]  David J. Hand,et al.  On Pruning and Averaging Decision Trees , 1995, ICML.

[76]  Marcus Hutter,et al.  Universal Algorithmic Intelligence: A Mathematical Top→Down Approach , 2007, Artificial General Intelligence.

[77]  Shane Legg,et al.  Universal Intelligence: A Definition of Machine Intelligence , 2007, Minds and Machines.

[78]  Joelle Pineau,et al.  Bayes-Adaptive POMDPs , 2007, NIPS.

[79]  David Silver,et al.  Combining Online and Offline Learning in UCT , 2007 .

[80]  Joelle Pineau,et al.  Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[81]  S. Legg Machine super intelligence , 2008 .

[82]  P. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications, Third Edition , 1997, Texts in Computer Science.

[83]  Sergey Pankov A computational approximation to the AIXI model , 2008, AGI.

[84]  Marcus Hutter,et al.  Feature Dynamic Bayesian Networks , 2008, ArXiv.

[85]  Pascal Poupart,et al.  Model-based Bayesian Reinforcement Learning in Partially Observable Domains , 2008, ISAIM.

[86]  Bret Hoehn,et al.  Effective short-term opponent exploitation in simplified poker , 2005, Machine Learning.

[87]  H. Jaap van den Herik,et al.  Parallel Monte-Carlo Tree Search , 2008, Computers and Games.

[88]  Yngvi Björnsson,et al.  Simulation-Based Approach to General Game Playing , 2008, AAAI.

[89]  H. Jaap van den Herik,et al.  Progressive Strategies for Monte-Carlo Tree Search , 2008 .

[90]  Marcus Hutter,et al.  Feature Markov Decision Processes , 2008, ArXiv.

[91]  Gerald Tesauro,et al.  Monte-Carlo simulation balancing , 2009, ICML '09.

[92]  Joel Veness,et al.  Bootstrapping from Game Tree Search , 2009, NIPS.

[93]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[94]  Takaki Makino,et al.  Proto-predictive representation of states with simple recurrent temporal-difference networks , 2009, ICML '09.

[95]  Joel Veness,et al.  Reinforcement Learning via AIXI Approximation , 2010, AAAI.

[96]  Byron Boots,et al.  Closing the learning-planning loop with predictive state representations , 2009, Int. J. Robotics Res..

[97]  Benjamin Van Roy,et al.  Universal Reinforcement Learning , 2007, IEEE Transactions on Information Theory.

[98]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[99]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..