REALab: An Embedded Perspective on Tampering

This paper describes REALab, a platform for embedded agency research in reinforcement learning (RL). REALab is designed to model the structure of tampering problems that may arise in real-world deployments of RL. Standard Markov Decision Process (MDP) formulations of RL and simulated environments mirroring the MDP structure assume secure access to feedback (e.g., rewards). This may be unrealistic in settings where agents are embedded and can corrupt the processes producing feedback (e.g., human supervisors, or an implemented reward function). We describe an alternative Corrupt Feedback MDP formulation and the REALab environment platform, which both avoid the secure feedback assumption. We hope the design of REALab provides a useful perspective on tampering problems, and that the platform may serve as a unit test for the presence of tampering incentives in RL agent designs.

[1]  Stuart Armstrong,et al.  Good and safe uses of AI Oracles , 2017, ArXiv.

[2]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[3]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[4]  Daniel Dewey,et al.  Learning What to Value , 2011, AGI.

[5]  Joshua A. Tucker,et al.  Is Online Political Communication More Than an Echo Chamber? , 2022 .

[6]  Jure Leskovec,et al.  Steering user behavior with badges , 2013, WWW.

[7]  Marcin Andrychowicz,et al.  Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[8]  A. Tversky,et al.  Judgment under Uncertainty , 1982 .

[9]  Koen Holtman,et al.  Corrigibility with Utility Preservation , 2019, ArXiv.

[10]  Sergey Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[11]  Laurent Orseau,et al.  Reinforcement Learning with a Corrupted Reward Channel , 2017, IJCAI.

[12]  Nando de Freitas,et al.  RL Unplugged: Benchmarks for Offline Reinforcement Learning , 2020, ArXiv.

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  Justin M. Rao,et al.  Filter Bubbles, Echo Chambers, and Online News Consumption , 2016 .

[15]  Shawn P. Curley,et al.  Do Recommender Systems Manipulate Consumer Preferences? A Study of Anchoring Effects , 2013, Inf. Syst. Res..

[16]  Arushi Majha,et al.  Categorizing Wireheading in Partially Embedded Agents , 2019, AISafety@IJCAI.

[17]  Weiyan Shi,et al.  Effects of Persuasive Dialogues: Testing Bot Identities and Inquiry Strategies , 2020, CHI.

[18]  Nando de Freitas,et al.  Neural Programmer-Interpreters , 2015, ICLR.

[19]  Nick Bostrom,et al.  Thinking Inside the Box: Controlling and Using an Oracle AI , 2012, Minds and Machines.

[20]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[21]  Marcus Hutter,et al.  Reinforcement learning with value advice , 2014, ACML.

[22]  Yee Whye Teh,et al.  Neural probabilistic motor primitives for humanoid control , 2018, ICLR.

[23]  Richard L. Lewis,et al.  Where Do Rewards Come From , 2009 .

[24]  Marcus Hutter,et al.  Asymptotically Unambitious Artificial General Intelligence , 2019, AAAI.

[25]  Laurent Orseau,et al.  Delusion, Survival, and Intelligent Agents , 2011, AGI.

[26]  Scott Garrabrant,et al.  Risks from Learned Optimization in Advanced Machine Learning Systems , 2019, ArXiv.

[27]  Shane Legg,et al.  Scalable agent alignment via reward modeling: a research direction , 2018, ArXiv.

[28]  Tom Everitt,et al.  Towards Safe Artificial General Intelligence , 2018 .

[29]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[30]  Koen Holtman,et al.  Towards AGI Agent Safety by Iteratively Improving the Utility Function , 2020, AGI.

[31]  Jason Mancuso,et al.  Detecting Spiky Corruption in Markov Decision Processes , 2019, AISafety@IJCAI.

[32]  Tom Eccles,et al.  An investigation of model-free planning , 2019, ICML.

[33]  Marcus Hutter,et al.  Avoiding Wireheading with Value Reinforcement Learning , 2016, AGI.

[34]  Laurent Orseau,et al.  Pitfalls of learning a reward function online , 2020, IJCAI.

[35]  D. Kahneman,et al.  Heuristics and Biases: The Psychology of Intuitive Judgment , 2002 .

[36]  Anca D. Dragan,et al.  Inverse Reward Design , 2017, NIPS.

[37]  Anca D. Dragan,et al.  Reward-rational (implicit) choice: A unifying formalism for reward learning , 2020, NeurIPS.

[38]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[39]  Laurent Orseau,et al.  Safely Interruptible Agents , 2016, UAI.

[40]  P. Stone,et al.  TAMER: Training an Agent Manually via Evaluative Reinforcement , 2008, 2008 7th IEEE International Conference on Development and Learning.

[41]  Anca D. Dragan,et al.  The Off-Switch Game , 2016, IJCAI.

[42]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[43]  Nick Bostrom,et al.  Superintelligence: Paths, Dangers, Strategies , 2014 .

[44]  D. Kahneman Maps of Bounded Rationality: Psychology for Behavioral Economics , 2003 .

[45]  Karol Hausman,et al.  Learning to Interactively Learn and Assist , 2019, AAAI.

[46]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[47]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.