论文信息 - Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction - 字舞流文

Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction

Tree Search (TS) is crucial to some of the most influential successes in reinforcement learning. Here, we tackle two major challenges with TS that limit its usability: distribution shift and scalability. We first discover and analyze a counter-intuitive phenomenon: action selection through TS and a pre-trained value function often leads to lower performance compared to the original pre-trained agent, even when having access to the exact state and reward in future steps. We show this is due to a distribution shift to areas where value estimates are highly inaccurate and analyze this effect using Extreme Value theory. To overcome this problem, we introduce a novel off-policy correction term that accounts for the mismatch between the pre-trained value and its corresponding TS policy by penalizing under-sampled trajectories. We prove that our correction eliminates the above mismatch and bound the probability of sub-optimal action selection. Our correction significantly improves pre-trained Rainbow agents without any further training, often more than doubling their scores on Atari games. Next, we address the scalability issue given by the computational complexity of exhaustive TS that scales exponentially with the tree depth. We introduce Batch-BFS: a GPU breadth-first search that advances all nodes in each depth of the tree simultaneously. Batch-BFS reduces runtime by two orders of magnitude and, beyond inference, enables also training with TS of depths that were not feasible before. We train DQN agents from scratch using TS and show improvement in several Atari games compared to both the original DQN and the more advanced Rainbow.

Shie Mannor | Gal Chechik | Iuri Frosio | Gal Dalal | Assaf Hallak | Steven Dalton | Assaf Hallak | Shie Mannor | Gal Dalal | Gal Chechik | I. Frosio | Steven Dalton

[1] C. Edwards,et al. Rational Chebyshev approximations for the inverse of the error function , 1976 .

[2] Hado van Hasselt,et al. Double Q-learning , 2010, NIPS.

[3] Joost Broekens,et al. Think Too Fast Nor Too Slow: The Computational Trade-off Between Planning And Reinforcement Learning , 2020, ArXiv.

[4] Marlos C. Machado,et al. Generalization and Regularization in DQN , 2018, ArXiv.

[5] Demis Hassabis,et al. Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[6] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[7] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[8] Marko Bacic,et al. Model predictive control , 2003 .

[9] Rémi Coulom,et al. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[10] Sergey Levine,et al. Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[11] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[12] Eric P. Smith,et al. An Introduction to Statistical Modeling of Extreme Values , 2002, Technometrics.

[13] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[14] Simon M. Lucas,et al. A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[15] Uri Shalit,et al. Deep Kalman Filters , 2015, ArXiv.

[16] S. LaValle. Rapidly-exploring random trees : a new tool for path planning , 1998 .

[17] Honglak Lee,et al. Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[18] Razvan Pascanu,et al. Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[19] David Silver,et al. Online and Offline Reinforcement Learning by Planning with a Learned Model , 2021, NeurIPS.

[20] I. Frosio,et al. Accelerating Reinforcement Learning through GPU Atari Emulation , 2019, NeurIPS.

[21] Mario Zanon,et al. Safe Reinforcement Learning Using Robust MPC , 2019, IEEE Transactions on Automatic Control.

[22] Shie Mannor,et al. How to Combine Tree-Search Methods in Reinforcement Learning , 2018, AAAI.

[23] Shane Legg,et al. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[24] F. P. Cantelli. Sui confini della probabilità , 1929 .

[25] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[26] Sergey Levine,et al. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[27] Nils J. Nilsson,et al. A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[28] Shie Mannor,et al. Multiple-Step Greedy Policies in Approximate and Online Reinforcement Learning , 2018, NeurIPS.

[29] Marlos C. Machado,et al. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[30] Demis Hassabis,et al. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[31] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[32] Jongmin Lee,et al. OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation , 2021, ICML.

[33] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[34] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[35] Nolan Wagener,et al. Information theoretic MPC for model-based reinforcement learning , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[36] Sergey Levine,et al. Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[37] Dieter Fox,et al. GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning , 2018, CoRL.

[38] Sanja Fidler,et al. Learning to Simulate Dynamic Environments With GameGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Honglak Lee,et al. Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[40] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[41] Shie Mannor,et al. Beyond the One Step Greedy Approach in Reinforcement Learning , 2018, ICML.

[42] Gabriel Kalweit,et al. Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning , 2017, CoRL.