Muesli: Combining Improvements in Policy Optimization

We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero's state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by extensive ablations, and by additional results on continuous control and 9x9 Go.

[1]  Satinder Singh,et al.  The Value Equivalence Principle for Model-Based Reinforcement Learning , 2020, NeurIPS.

[2]  Jürgen Schmidhuber,et al.  An on-line algorithm for dynamic reinforcement learning and planning in reactive environments , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[3]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[4]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[5]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[6]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[7]  Patrick M. Pilarski,et al.  Model-Free reinforcement learning with continuous action in practice , 2012, 2012 American Control Conference (ACC).

[8]  Rémi Munos,et al.  Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[9]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[10]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[11]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[12]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[13]  Karl Johan Åström,et al.  Optimal control of Markov processes with incomplete state information , 1965 .

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[16]  David Silver,et al.  Learning values across many orders of magnitude , 2016, NIPS.

[17]  Bruno Scherrer,et al.  Leverage the Average: an Analysis of Regularization in RL , 2020, ArXiv.

[18]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[19]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[20]  Doina Precup,et al.  Value-driven Hindsight Modelling , 2020, NeurIPS.

[21]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[22]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24]  Jackie Kay,et al.  Local Search for Policy Iteration in Continuous Control , 2020, ArXiv.

[25]  Mohammad Ghavamzadeh,et al.  Mirror Descent Policy Optimization , 2020, ArXiv.

[26]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[27]  Daniel Guo,et al.  Bootstrap Latent-Predictive Representations for Multitask Reinforcement Learning , 2020, ICML.

[28]  Rémi Munos,et al.  Neural Predictive Belief Representations , 2018, ArXiv.

[29]  Karen Simonyan,et al.  Off-Policy Actor-Critic with Shared Experience Replay , 2020, ICML.

[30]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[31]  Sham M. Kakade,et al.  On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[32]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[33]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[34]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[35]  Jessica B. Hamrick,et al.  Analogues of mental simulation and imagination in deep learning , 2019, Current Opinion in Behavioral Sciences.

[36]  S. Kakade,et al.  Reinforcement Learning: Theory and Algorithms , 2019 .

[37]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[38]  Fabio Viola,et al.  Causally Correct Partial Models for Reinforcement Learning , 2020, ArXiv.

[39]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[40]  Shimon Whiteson,et al.  TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning , 2017, ICLR.

[41]  Marc G. Bellemare,et al.  The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning , 2017, ICLR.

[42]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[43]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[44]  Catholijn M. Jonker,et al.  Model-based Reinforcement Learning: A Survey , 2020, ArXiv.

[45]  Martin A. Riedmiller,et al.  Imagined Value Gradients: Model-Based Policy Optimization with Transferable Latent Dynamics Models , 2019, CoRL.

[46]  Eric Nalisnick,et al.  Normalizing Flows for Probabilistic Modeling and Inference , 2019, J. Mach. Learn. Res..

[47]  John Schulman,et al.  Phasic Policy Gradient , 2020, ICML.

[48]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[49]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[50]  Pieter Abbeel,et al.  Value Iteration Networks , 2016, NIPS.

[51]  Thomas Degris,et al.  Scaling-up Knowledge for a Cognizant Robot , 2012, AAAI Spring Symposium: Designing Intelligent Robots.

[52]  Tom Eccles,et al.  An investigation of model-free planning , 2019, ICML.

[53]  Petr Baudis,et al.  PACHI: State of the Art Open Source Go Program , 2011, ACG.

[54]  David Silver,et al.  On Inductive Biases in Deep Reinforcement Learning , 2019, ArXiv.

[55]  Leslie Pack Kaelbling,et al.  Hierarchical task and motion planning in the now , 2011, 2011 IEEE International Conference on Robotics and Automation.

[56]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[57]  David Silver,et al.  Online and Offline Reinforcement Learning by Planning with a Learned Model , 2021, NeurIPS.

[58]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[59]  Rémi Munos,et al.  Observe and Look Further: Achieving Consistent Performance on Atari , 2018, ArXiv.

[60]  Razvan Pascanu,et al.  Learning model-based planning from scratch , 2017, ArXiv.

[61]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[62]  David Silver,et al.  Learning and Planning in Complex Action Spaces , 2021, ICML.

[63]  Matteo Hessel,et al.  General non-linear Bellman equations , 2019, ArXiv.

[64]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[65]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[66]  Matteo Hessel,et al.  When to use parametric models in reinforcement learning? , 2019, NeurIPS.

[67]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[68]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[69]  M.A. Wiering,et al.  Reinforcement Learning in Continuous Action Spaces , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[70]  Matteo Hessel,et al.  Podracer architectures for scalable Reinforcement Learning , 2021, ArXiv.

[71]  Sriram Srinivasan,et al.  OpenSpiel: A Framework for Reinforcement Learning in Games , 2019, ArXiv.

[72]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[73]  Jessica B. Hamrick,et al.  On the role of planning in model-based deep reinforcement learning , 2020, ArXiv.

[74]  Richard S. Sutton,et al.  Learning to Predict Independent of Span , 2015, ArXiv.

[75]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[76]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[77]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[78]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[79]  J. L. Testud,et al.  Paper: Model predictive heuristic control , 1978 .

[80]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[81]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[82]  Aaron van den Oord,et al.  Shaping Belief States with Generative Environment Models for RL , 2019, NeurIPS.

[83]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[84]  J. Richalet,et al.  Model predictive heuristic control: Applications to industrial processes , 1978, Autom..

[85]  Hao Chen,et al.  ACE: An Actor Ensemble Algorithm for Continuous Control with Tree Search , 2018, AAAI.

[86]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[87]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[88]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[89]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[90]  Allan Jabri,et al.  Universal Planning Networks: Learning Generalizable Representations for Visuomotor Control , 2018, ICML.

[91]  Jing Peng,et al.  Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[92]  Jessica B. Hamrick,et al.  Combining Q-Learning and Search with Amortized Value Estimates , 2020, ICLR.

[93]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[94]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[95]  Mohammad Norouzi,et al.  Mastering Atari with Discrete World Models , 2020, ICLR.

[96]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.