论文信息 - Critic Regularized Regression

Critic Regularized Regression

Offline reinforcement learning (RL), also known as batch RL, offers the prospect of policy optimization from large pre-recorded datasets without online environment interaction. It addresses challenges with regard to the cost of data collection and safety, both of which are particularly pertinent to real-world applications of RL. Unfortunately, most off-policy algorithms perform poorly when learning from a fixed dataset. In this paper, we propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR). We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces -- outperforming several state-of-the-art offline RL algorithms by a significant margin on a wide range of benchmark tasks.

[1] Che Wang,et al. BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning , 2020, NeurIPS.

[2] Martin A. Riedmiller,et al. Batch Reinforcement Learning , 2012, Reinforcement Learning.

[3] Doina Precup,et al. Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[4] Sergey Levine,et al. Reward-Conditioned Policies , 2019, ArXiv.

[5] Yuval Tassa,et al. Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[6] Nicolas Heess,et al. Hierarchical visuomotor control of humanoids , 2018, ICLR.

[7] Sergey Levine,et al. Accelerating Online Reinforcement Learning with Offline Datasets , 2020, ArXiv.

[8] Demis Hassabis,et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[9] Sergey Levine,et al. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[10] Yuval Tassa,et al. DeepMind Control Suite , 2018, ArXiv.

[11] Dale Schuurmans,et al. Striving for Simplicity in Off-policy Deep Reinforcement Learning , 2019, ArXiv.

[12] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[13] Yuval Tassa,et al. Deep neuroethology of a virtual rodent , 2019, ICLR.

[14] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[15] Sergey Levine,et al. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[16] Sergio Gomez Colmenarejo,et al. Acme: A Research Framework for Distributed Reinforcement Learning , 2020, ArXiv.

[17] Natasha Jaques,et al. Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , 2019, ArXiv.

[18] Oleg O. Sushkov,et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning , 2019, Robotics: Science and Systems.

[19] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[20] Yuval Tassa,et al. Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[21] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[22] Dean Pomerleau,et al. ALVINN, an autonomous land vehicle in a neural network , 2015 .

[23] Gabriel Dulac-Arnold,et al. Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[24] Yuval Tassa,et al. Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[25] Joelle Pineau,et al. Benchmarking Batch Deep Reinforcement Learning Algorithms , 2019, ArXiv.

[26] Sergio Gomez Colmenarejo,et al. RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning , 2020 .

[27] M.A. Wiering,et al. Reinforcement Learning in Continuous Action Spaces , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[28] Marcin Andrychowicz,et al. Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[29] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30] Martin A. Riedmiller,et al. Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning , 2020, ICLR.

[31] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[32] Qing Wang,et al. Exponentially Weighted Imitation Learning for Batched Historical Data , 2018, NeurIPS.

[33] Yanjun Han,et al. Minimax estimation of discrete distributions , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[34] Jan Peters,et al. Fitted Q-iteration by Advantage Weighted Regression , 2008, NIPS.

[35] Yee Whye Teh,et al. Neural probabilistic motor primitives for humanoid control , 2018, ICLR.

[36] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[37] Sergey Levine,et al. Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.