论文信息 - Offline Learning from Demonstrations and Unlabeled Experience - 字舞流文

Offline Learning from Demonstrations and Unlabeled Experience

Behavior cloning (BC) is often practical for robot learning because it allows a policy to be trained offline without rewards, by supervised learning on expert demonstrations. However, BC does not effectively leverage what we will refer to as unlabeled experience: data of mixed and unknown quality without reward annotations. This unlabeled data can be generated by a variety of sources such as human teleoperation, scripted policies and other agents on the same robot. Towards data-driven offline robot learning that can use this unlabeled experience, we introduce Offline Reinforced Imitation Learning (ORIL). ORIL first learns a reward function by contrasting observations from demonstrator and unlabeled trajectories, then annotates all data with the learned reward, and finally trains an agent via offline reinforcement learning. Across a diverse set of continuous control and simulated robotic manipulation tasks, we show that ORIL consistently outperforms comparable BC agents by effectively leveraging unlabeled experience.

Misha Denil | Nando de Freitas | Alexander Novikov | Ksenia Konyushkova | Ziyu Wang | Scott Reed | Yusuf Aytar | Caglar Gulcehre | Scott E. Reed | Konrad Żołna | Caglar Gulcehre | Ziyun Wang | N. D. Freitas | Misha Denil | Y. Aytar | Alexander Novikov | Ksenia Konyushkova | Konrad Zolna

[1] Che Wang,et al. BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning , 2020, NeurIPS.

[2] Gang Niu,et al. Convex Formulation for Learning from Positive and Unlabeled Data , 2015, ICML.

[3] Sergey Levine,et al. Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[4] Nando de Freitas,et al. Critic Regularized Regression , 2020, NeurIPS.

[5] Dean Pomerleau,et al. ALVINN, an autonomous land vehicle in a neural network , 2015 .

[6] Anca D. Dragan,et al. SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards , 2019, ICLR.

[7] Martin A. Riedmiller,et al. Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[8] Martin A. Riedmiller,et al. Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning , 2020, ICLR.

[9] Justin Fu,et al. D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[10] Charles Elkan,et al. Learning classifiers from only positive and unlabeled data , 2008, KDD.

[11] Nando de Freitas,et al. RL Unplugged: Benchmarks for Offline Reinforcement Learning , 2020, ArXiv.

[12] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13] Stefano Ermon,et al. Generative Adversarial Imitation Learning , 2016, NIPS.

[14] Doina Precup,et al. Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[15] Marcin Andrychowicz,et al. Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[16] Stefano Ermon,et al. InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations , 2017, NIPS.

[17] Misha Denil,et al. Positive-Unlabeled Reward Learning , 2019, CoRL.

[18] Nando de Freitas,et al. Reinforcement and Imitation Learning for Diverse Visuomotor Skills , 2018, Robotics: Science and Systems.

[19] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[20] Sergey Levine,et al. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[21] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[22] Oleg O. Sushkov,et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning , 2019, Robotics: Science and Systems.

[23] Sergey Levine,et al. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[24] Richard Tanburn,et al. Making Efficient Use of Demonstrations to Solve Hard Exploration Problems , 2019, ICLR.

[25] Ilya Kostrikov,et al. Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning , 2018, ICLR.

[26] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[27] Yoshua Bengio,et al. Reinforced Imitation in Heterogeneous Action Space , 2019, ArXiv.

[28] Shie Mannor,et al. End-to-End Differentiable Adversarial Imitation Learning , 2017, ICML.

[29] Yuval Tassa,et al. Learning human behaviors from motion capture by adversarial imitation , 2017, ArXiv.

[30] Matthew W. Hoffman,et al. Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[31] Gang Niu,et al. Positive-Unlabeled Learning with Non-Negative Risk Estimator , 2017, NIPS.

[32] Oleg O. Sushkov,et al. A Practical Approach to Insertion with Variable Socket Position Using Deep Reinforcement Learning , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[33] Martin A. Riedmiller,et al. Batch Reinforcement Learning , 2012, Reinforcement Learning.

[34] Sergey Levine,et al. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[35] Yoshua Bengio,et al. Combating False Negatives in Adversarial Imitation Learning , 2020, 2021 International Joint Conference on Neural Networks (IJCNN).

[36] Nando de Freitas,et al. Hyperparameter Selection for Offline Reinforcement Learning , 2020, ArXiv.

[37] Misha Denil,et al. Task-Relevant Adversarial Imitation Learning , 2019, CoRL.

[38] Rémi Munos,et al. Observe and Look Further: Achieving Consistent Performance on Atari , 2018, ArXiv.

[39] Qing Wang,et al. Exponentially Weighted Imitation Learning for Batched Historical Data , 2018, NeurIPS.

[40] Pieter Abbeel,et al. An Algorithmic Perspective on Imitation Learning , 2018, Found. Trends Robotics.

[41] Sergey Levine,et al. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.