Understanding the Effects of Dataset Characteristics on Offline Reinforcement Learning

In real world, affecting the environment by a weak policy can be expensive or very risky, therefore hampers real world applications of reinforcement learning. Offline Reinforcement Learning (RL) can learn policies from a given dataset without interacting with the environment. However, the dataset is the only source of information for an Offline RL algorithm and determines the performance of the learned policy. We still lack studies on how dataset characteristics influence different Offline RL algorithms. Therefore, we conducted a comprehensive empirical analysis of how dataset characteristics effect the performance of Offline RL algorithms for discrete action environments. A dataset is characterized by two metrics: (1) the average dataset return measured by the Trajectory Quality (TQ) and (2) the coverage measured by the State-Action Coverage (SACo). We found that variants of the off-policy Deep Q-Network family require datasets with high SACo to perform well. Algorithms that constrain the learned policy towards the given dataset perform well for datasets with high TQ or SACo. For datasets with high TQ, Behavior Cloning outperforms or performs similarly to the best Offline RL algorithms.

[1]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[2]  Olivier Sigaud,et al.  Offline Reinforcement Learning Hands-On , 2020, ArXiv.

[3]  S. Levine,et al.  Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[4]  Mohammad Norouzi,et al.  An Optimistic Perspective on Offline Reinforcement Learning , 2020, ICML.

[5]  Sepp Hochreiter,et al.  Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER , 2020, Trans. Large Scale Data Knowl. Centered Syst..

[6]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[7]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[8]  Joelle Pineau,et al.  Benchmarking Batch Deep Reinforcement Learning Algorithms , 2019, ArXiv.

[9]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[10]  Fred L. Drake,et al.  Python 3 Reference Manual , 2009 .

[11]  Sepp Hochreiter,et al.  Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution , 2020, ArXiv.

[12]  Martin A. Riedmiller,et al.  Collect & Infer - a fresh look at data-efficient Reinforcement Learning , 2021, CoRL.

[13]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[14]  Dean Pomerleau,et al.  Efficient Training of Artificial Neural Networks for Autonomous Navigation , 1991, Neural Computation.

[15]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[16]  Sergey Levine,et al.  RoboNet: Large-Scale Multi-Robot Learning , 2019, CoRL.

[17]  R. McFarlane A Survey of Exploration Strategies in Reinforcement Learning , 2003 .

[18]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[19]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[20]  Sepp Hochreiter,et al.  RUDDER: Return Decomposition for Delayed Rewards , 2018, NeurIPS.

[21]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[22]  Trevor Darrell,et al.  BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling , 2018, ArXiv.

[23]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[24]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[25]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[26]  Gabriel Dulac-Arnold,et al.  Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[27]  Sergey Levine,et al.  D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[28]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[29]  Tian Tian,et al.  MinAtar: An Atari-Inspired Testbed for Thorough and Reproducible Reinforcement Learning Experiments , 2019 .

[30]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[31]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.