Avoiding Tampering Incentives in Deep RL via Decoupled Approval

How can we design agents that pursue a given objective when all feedback mechanisms are influenceable by the agent? Standard RL algorithms assume a secure reward function, and can thus perform poorly in settings where agents can tamper with the reward-generating mechanism. We present a principled solution to the problem of learning from influenceable feedback, which combines approval with a decoupled feedback collection procedure. For a natural class of corruption functions, decoupled approval algorithms have aligned incentives both at convergence and for their local updates. Empirically, they also scale to complex 3D environments where tampering is possible.

[1]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[2]  Soumya Ghosh,et al.  Quality of Uncertainty Quantification for Bayesian Neural Network Inference , 2019, ArXiv.

[3]  Peter Stone,et al.  Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces , 2017, AAAI.

[4]  Carlos Celemin,et al.  COACH: Learning continuous actions from COrrective Advice Communicated by Humans , 2015, 2015 International Conference on Advanced Robotics (ICAR).

[5]  Owain Evans,et al.  Trial without Error: Towards Safe Reinforcement Learning via Human Intervention , 2017, AAMAS.

[6]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[7]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[8]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[9]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[10]  Laurent Orseau,et al.  Delusion, Survival, and Intelligent Agents , 2011, AGI.

[11]  Laurent Orseau,et al.  Reinforcement Learning with a Corrupted Reward Channel , 2017, IJCAI.

[12]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[13]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[14]  Daniel Dewey,et al.  Learning What to Value , 2011, AGI.

[15]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[16]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[17]  Nando de Freitas,et al.  Neural Programmer-Interpreters , 2015, ICLR.

[18]  Scott Garrabrant,et al.  Risks from Learned Optimization in Advanced Machine Learning Systems , 2019, ArXiv.

[19]  Marcus Hutter,et al.  Reinforcement learning with value advice , 2014, ACML.

[20]  Tom Everitt,et al.  Towards Safe Artificial General Intelligence , 2018 .

[21]  Krzysztof Z. Gajos,et al.  Preference elicitation for interface optimization , 2005, UIST.

[22]  Dario Amodei,et al.  AI safety via debate , 2018, ArXiv.

[23]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[24]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[25]  Anca D. Dragan,et al.  Inverse Reward Design , 2017, NIPS.

[26]  Yee Whye Teh,et al.  Neural probabilistic motor primitives for humanoid control , 2018, ICLR.

[27]  Sergey Levine,et al.  Generalizing Skills with Semi-Supervised Reinforcement Learning , 2016, ICLR.

[28]  Luca Belli,et al.  From Optimizing Engagement to Measuring Value , 2020, ArXiv.

[29]  Nick Cammarata,et al.  Zoom In: An Introduction to Circuits , 2020 .

[30]  Shane Legg,et al.  Scalable agent alignment via reward modeling: a research direction , 2018, ArXiv.

[31]  P. Stone,et al.  TAMER: Training an Agent Manually via Evaluative Reinforcement , 2008, 2008 7th IEEE International Conference on Development and Learning.

[32]  Arvind Satyanarayan,et al.  The Building Blocks of Interpretability , 2018 .

[33]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[34]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[35]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[36]  Anca D. Dragan,et al.  Learning Human Objectives by Evaluating Hypothetical Behavior , 2019, ICML.

[37]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[38]  Richard Tanburn,et al.  Making Efficient Use of Demonstrations to Solve Hard Exploration Problems , 2019, ICLR.

[39]  Shane Legg,et al.  The Incentives that Shape Behaviour , 2020, ArXiv.

[40]  Peter Stone,et al.  Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[41]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[42]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[43]  Michael L. Littman,et al.  Deep Reinforcement Learning from Policy-Dependent Human Feedback , 2019, ArXiv.

[44]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[45]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[46]  M ISLEADING META-OBJECTIVES AND HIDDEN INCENTIVES FOR DISTRIBUTIONAL SHIFT , 2019 .

[47]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[48]  Chelsea Finn,et al.  Language as an Abstraction for Hierarchical Deep Reinforcement Learning , 2019, NeurIPS.

[49]  C. Robert Superintelligence: Paths, Dangers, Strategies , 2017 .

[50]  Dario Amodei,et al.  Supervising strong learners by amplifying weak experts , 2018, ArXiv.

[51]  Stuart Armstrong,et al.  Good and safe uses of AI Oracles , 2017, ArXiv.

[52]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[53]  Ramana Kumar,et al.  Modeling AGI Safety Frameworks with Causal Influence Diagrams , 2019, AISafety@IJCAI.

[54]  Marcus Hutter,et al.  Reward tampering problems and solutions in reinforcement learning: a causal influence diagram perspective , 2019, Synthese.

[55]  Jason Mancuso,et al.  Detecting Spiky Corruption in Markov Decision Processes , 2019, AISafety@IJCAI.