论文信息 - Muesli: Combining Improvements in Policy Optimization

Muesli: Combining Improvements in Policy Optimization

We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero's state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by extensive ablations, and by additional results on continuous control and 9x9 Go.

[1] Satinder Singh,et al. The Value Equivalence Principle for Model-Based Reinforcement Learning , 2020, NeurIPS.

[2] Jürgen Schmidhuber,et al. An on-line algorithm for dynamic reinforcement learning and planning in reactive environments , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[3] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[4] Demis Hassabis,et al. Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[5] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[6] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[7] Patrick M. Pilarski,et al. Model-Free reinforcement learning with continuous action in practice , 2012, 2012 American Control Conference (ACC).

[8] Rémi Munos,et al. Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[9] Michael I. Jordan,et al. Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[10] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[11] Rémi Coulom,et al. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[12] Sergey Levine,et al. Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[13] Karl Johan Åström,et al. Optimal control of Markov processes with incomplete state information , 1965 .

[14] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[16] David Silver,et al. Learning values across many orders of magnitude , 2016, NIPS.

[17] Bruno Scherrer,et al. Leverage the Average: an Analysis of Regularization in RL , 2020, ArXiv.

[18] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[19] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[20] Doina Precup,et al. Value-driven Hindsight Modelling , 2020, NeurIPS.

[21] Shane Legg,et al. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[22] Sergey Levine,et al. When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[23] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24] Jackie Kay,et al. Local Search for Policy Iteration in Continuous Control , 2020, ArXiv.

[25] Mohammad Ghavamzadeh,et al. Mirror Descent Policy Optimization , 2020, ArXiv.

[26] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[27] Daniel Guo,et al. Bootstrap Latent-Predictive Representations for Multitask Reinforcement Learning , 2020, ICML.

[28] Rémi Munos,et al. Neural Predictive Belief Representations , 2018, ArXiv.

[29] Karen Simonyan,et al. Off-Policy Actor-Critic with Shared Experience Replay , 2020, ICML.

[30] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[31] Sham M. Kakade,et al. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[32] S. Levine,et al. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[33] Marlos C. Machado,et al. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[34] Henry Zhu,et al. Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[35] Jessica B. Hamrick,et al. Analogues of mental simulation and imagination in deep learning , 2019, Current Opinion in Behavioral Sciences.

[36] S. Kakade,et al. Reinforcement Learning: Theory and Algorithms , 2019 .

[37] Honglak Lee,et al. Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[38] Fabio Viola,et al. Causally Correct Partial Models for Reinforcement Learning , 2020, ArXiv.

[39] Joel Veness,et al. Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[40] Shimon Whiteson,et al. TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning , 2017, ICLR.

[41] Marc G. Bellemare,et al. The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning , 2017, ICLR.

[42] Tom Schaul,et al. The Predictron: End-To-End Learning and Planning , 2016, ICML.

[43] Daan Wierstra,et al. Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[44] Catholijn M. Jonker,et al. Model-based Reinforcement Learning: A Survey , 2020, ArXiv.

[45] Martin A. Riedmiller,et al. Imagined Value Gradients: Model-Based Policy Optimization with Transferable Latent Dynamics Models , 2019, CoRL.

[46] Eric Nalisnick,et al. Normalizing Flows for Probabilistic Modeling and Inference , 2019, J. Mach. Learn. Res..

[47] John Schulman,et al. Phasic Policy Gradient , 2020, ICML.

[48] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[49] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[50] Pieter Abbeel,et al. Value Iteration Networks , 2016, NIPS.

[51] Thomas Degris,et al. Scaling-up Knowledge for a Cognizant Robot , 2012, AAAI Spring Symposium: Designing Intelligent Robots.

[52] Tom Eccles,et al. An investigation of model-free planning , 2019, ICML.

[53] Petr Baudis,et al. PACHI: State of the Art Open Source Go Program , 2011, ACG.

[54] David Silver,et al. On Inductive Biases in Deep Reinforcement Learning , 2019, ArXiv.

[55] Leslie Pack Kaelbling,et al. Hierarchical task and motion planning in the now , 2011, 2011 IEEE International Conference on Robotics and Automation.

[56] Razvan Pascanu,et al. Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[57] David Silver,et al. Online and Offline Reinforcement Learning by Planning with a Learned Model , 2021, NeurIPS.

[58] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[59] Rémi Munos,et al. Observe and Look Further: Achieving Consistent Performance on Atari , 2018, ArXiv.

[60] Razvan Pascanu,et al. Learning model-based planning from scratch , 2017, ArXiv.

[61] Satinder Singh,et al. Value Prediction Network , 2017, NIPS.

[62] David Silver,et al. Learning and Planning in Complex Action Spaces , 2021, ICML.

[63] Matteo Hessel,et al. General non-linear Bellman equations , 2019, ArXiv.

[64] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[65] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[66] Matteo Hessel,et al. When to use parametric models in reinforcement learning? , 2019, NeurIPS.

[67] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[68] Martin A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[69] M.A. Wiering,et al. Reinforcement Learning in Continuous Action Spaces , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[70] Matteo Hessel,et al. Podracer architectures for scalable Reinforcement Learning , 2021, ArXiv.

[71] Sriram Srinivasan,et al. OpenSpiel: A Framework for Reinforcement Learning in Games , 2019, ArXiv.

[72] Yuval Tassa,et al. Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[73] Jessica B. Hamrick,et al. On the role of planning in model-based deep reinforcement learning , 2020, ArXiv.

[74] Richard S. Sutton,et al. Learning to Predict Independent of Span , 2015, ArXiv.

[75] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .

[76] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[77] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[78] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[79] J. L. Testud,et al. Paper: Model predictive heuristic control , 1978 .

[80] Jürgen Schmidhuber,et al. Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[81] Michael I. Jordan,et al. Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.