论文信息 - Decoupled Exploration and Exploitation Policies for Sample-Efficient Reinforcement Learning

Decoupled Exploration and Exploitation Policies for Sample-Efficient Reinforcement Learning

Despite the close connection between exploration and sample efficiency, most state of the art reinforcement learning algorithms include no considerations for exploration beyond maximizing the entropy of the policy. In this work we address this seeming missed opportunity. We observe that the most common formulation of directed exploration in deep RL, known as bonus-based exploration (BBE), suffers from bias and slow coverage in the few-sample regime. This causes BBE to be actively detrimental to policy learning in many control tasks. We show that by decoupling the task policy from the exploration policy, directed exploration can be highly effective for sample-efficient continuous control. Our method, Decoupled Exploration and Exploitation Policies (DEEP), can be combined with any off-policy RL algorithm without modification. When used in conjunction with soft actorcritic, DEEP incurs no performance penalty in densely-rewarding environments. On sparse environments, DEEP gives a several-fold improvement in data efficiency due to better exploration.

[1] Marc G. Bellemare,et al. Count-Based Exploration with Neural Density Models , 2017, ICML.

[2] Sergey Levine,et al. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[3] Jean Feydy,et al. Kernel Operations on the GPU, with Autodiff, without Memory Overflows , 2020, ArXiv.

[4] Michael I. Jordan,et al. Is Q-learning Provably Efficient? , 2018, NeurIPS.

[5] Yuval Tassa,et al. Data-efficient Deep Reinforcement Learning for Dexterous Manipulation , 2017, ArXiv.

[6] Martin A. Riedmiller,et al. Continuous-Discrete Reinforcement Learning for Hybrid Control in Robotics , 2020, CoRL.

[7] Filip De Turck,et al. VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[8] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[9] Sergey Levine,et al. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[10] Andrew Y. Ng,et al. Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[11] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[12] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[13] Daniel Guo,et al. Never Give Up: Learning Directed Exploration Strategies , 2020, ICLR.

[14] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[15] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[16] Georg Ostrovski,et al. Temporally-Extended ε-Greedy Exploration , 2020, ICLR.

[17] Amos J. Storkey,et al. Exploration by Random Network Distillation , 2018, ICLR.

[18] Henry Zhu,et al. Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[19] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[20] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[21] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[22] Abhinav Gupta,et al. Dynamics-aware Embeddings , 2019, ICLR.

[23] Alexei A. Efros,et al. Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[24] Lihong Li,et al. PAC model-free reinforcement learning , 2006, ICML.

[25] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[26] Zheng Wen,et al. Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[27] Andrew W. Moore,et al. Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[28] Martin A. Riedmiller,et al. Reinforcement learning on explicitly specified time scales , 2003, Neural Computing & Applications.

[29] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[30] Filip De Turck,et al. #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[31] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[32] Marlos C. Machado,et al. Count-Based Exploration with the Successor Representation , 2018, AAAI.

[33] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[34] Shimon Whiteson,et al. Optimistic Exploration even with a Pessimistic Initialisation , 2020, ICLR.

[35] Christopher F. Parmeter,et al. Normal reference bandwidths for the general order, multivariate kernel density derivative estimator , 2012 .

[36] Marlos C. Machado,et al. On Bonus Based Exploration Methods In The Arcade Learning Environment , 2020, ICLR.

[37] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[39] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[40] Marcin Andrychowicz,et al. Parameter Space Noise for Exploration , 2017, ICLR.

[41] Michael L. Littman,et al. An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[42] Yuval Tassa,et al. Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[43] Martin A. Riedmiller,et al. Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.