Self-Tuning Deep Reinforcement Learning

Reinforcement learning (RL) algorithms often require expensive manual or automated hyperparameter searches in order to perform well on a new domain. This need is particularly acute in modern deep RL architectures which often incorporate many modules and multiple loss functions. In this paper, we take a step towards addressing this issue by using metagradients (Xu et al., 2018) to tune these hyperparameters via differentiable cross validation, whilst the agent interacts with and learns from the environment. We present the Self-Tuning Actor Critic (STAC) which uses this process to tune the hyperparameters of the usual loss function of the IMPALA actor critic agent (Espeholt et. al., 2018 ), to learn the hyperparameters that define auxiliary loss functions, and to balance trade offs in off policy learning by introducing and adapting the hyperparameters of a novel leaky V-trace operator. The method is simple to use, sample efficient and does not require significant increase in compute. Ablative studies show that the overall performance of STAC improves as we adapt more hyperparameters. When applied to 57 games on the Atari 2600 environment over 200 million frames our algorithm improves the median human normalized score of the baseline from 243% to 364%.

[1]  Nan Jiang,et al.  The Dependence of Effective Planning Horizon on Model Accuracy , 2015, AAMAS.

[2]  Satinder Singh,et al.  On Learning Intrinsic Rewards for Policy Gradient Methods , 2018, NeurIPS.

[3]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[4]  Tom Schaul,et al.  Adapting Behaviour for Learning Progress , 2019, ArXiv.

[5]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[6]  Fabian Pedregosa,et al.  Hyperparameter optimization with approximate gradient , 2016, ICML.

[7]  Will Dabney,et al.  Adaptive Trade-Offs in Off-Policy Learning , 2020, AISTATS.

[8]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[9]  Yoshua Bengio,et al.  Hyperbolic Discounting and Learning over Multiple Horizons , 2019, ArXiv.

[10]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[11]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[12]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[13]  Martha White,et al.  A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning , 2016, AAMAS.

[14]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[15]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[16]  Shie Mannor,et al.  Adaptive Lambda Least-Squares Temporal Difference Learning , 2016, 1612.09465.

[17]  Paolo Frasconi,et al.  Forward and Reverse Gradient-Based Hyperparameter Optimization , 2017, ICML.

[18]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[19]  Richard L. Lewis,et al.  Discovery of Useful Questions as Auxiliary Tasks , 2019, NeurIPS.

[20]  Matthew E. Taylor,et al.  Metatrace: Online Step-size Tuning by Meta-gradient Descent for Reinforcement Learning Control , 2018, ArXiv.

[21]  Karen Simonyan,et al.  Off-Policy Actor-Critic with Shared Experience Replay , 2020, ICML.

[22]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[23]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[24]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .