论文信息 - The Phenomenon of Policy Churn - 字舞流文

The Phenomenon of Policy Churn

We identify and study the phenomenon of policy churn , that is, the rapid change of the greedy policy in value-based reinforcement learning. Policy churn operates at a surprisingly rapid pace, changing the greedy action in a large fraction of states within a handful of learning updates (in a typical deep RL set-up such as DQN on Atari). We characterise the phenomenon empirically, verifying that it is not limited to speciﬁc algorithm or environment properties. A number of ablations help whittle down the plausible explanations on why churn occurs to just a handful, all related to deep learning. Finally, we hypothesise that policy churn is a beneﬁcial but overlooked form of implicit exploration that casts (cid:15) -greedy exploration in a fresh light, namely that (cid:15) -noise plays a much smaller role than expected.

T. Schaul | Georg Ostrovski | André Barreto | John Quan

[1] Ameet Talwalkar,et al. Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability , 2021, ICLR.

[2] R. Sarpong,et al. Bio-inspired synthesis of xishacorenes A, B, and C, and a new congener from fuscol† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc02572c , 2019, Chemical science.

[3] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[4] Geoffrey J. Gordon. Stable Function Approximation in Dynamic Programming , 1995, ICML.

[5] Georg Ostrovski,et al. The Difficulty of Passive Learning in Deep Reinforcement Learning , 2021, NeurIPS.

[6] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[7] Paul Wagner,et al. Policy oscillation is overshooting , 2014, Neural Networks.

[8] Michael J. Goard,et al. Stimulus-dependent representational drift in primary visual cortex , 2020, Nature Communications.

[9] Demis Hassabis,et al. Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[10] Tor Lattimore,et al. Behaviour Suite for Reinforcement Learning , 2019, ICLR.

[11] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[12] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[13] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[14] Shane Legg,et al. Noisy Networks for Exploration , 2017, ICLR.

[15] Yuanzhi Li,et al. An Alternative View: When Does SGD Escape Local Minima? , 2018, ICML.

[16] A. Gray,et al. I. THE ORIGIN OF SPECIES BY MEANS OF NATURAL SELECTION , 1963 .

[17] Neil Genzlinger. A. and Q , 2006 .

[18] Wenlong Fu,et al. Model-based reinforcement learning: A survey , 2018 .

[19] Tom Schaul,et al. Adapting Behaviour for Learning Progress , 2019, ArXiv.

[20] Kevin Barraclough,et al. I and i , 2001, BMJ : British Medical Journal.

[21] Sebastian Ruder,et al. An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[22] W. Hager,et al. and s , 2019, Shallow Water Hydraulics.

[23] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[24] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[25] Rémi Munos,et al. Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[26] Mark J. Nelson. Estimates for the Branching Factors of Atari Games , 2021, 2021 IEEE Conference on Games (CoG).

[27] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[28] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29] G. Mongillo,et al. Intrinsic volatility of synaptic connections — a challenge to the synaptic trace theory of memory , 2017, Current Opinion in Neurobiology.

[30] Alon Rubin,et al. Representational drift in the mouse visual cortex , 2020, Current Biology.

[31] Yee Whye Teh,et al. Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[32] Tom Schaul,et al. Return-based Scaling: Yet Another Normalisation Trick for Deep RL , 2021, ArXiv.

[33] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[34] Wulfram Gerstner,et al. Tag-Trigger-Consolidation: A Model of Early and Late Long-Term-Potentiation and Depression , 2008, PLoS Comput. Biol..

[35] R. Munos,et al. Revisiting Peng's Q(λ) for Modern Reinforcement Learning , 2021, ICML.

[36] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[37] Georg Ostrovski,et al. Temporally-Extended ε-Greedy Exploration , 2020, ICLR.

[38] T. Schaul,et al. When should agents explore? , 2021, ICLR.

[39] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[40] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[41] Richard Axel,et al. Representational drift in primary olfactory cortex , 2020, Nature.

[42] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[43] Marc G. Bellemare,et al. Increasing the Action Gap: New Operators for Reinforcement Learning , 2015, AAAI.