The Alignment Problem for Bayesian History-Based Reinforcement Learners∗

Value alignment is often considered a critical component of safe artificial intelligence. Meanwhile, reinforcement learning is often criticized as being inherently unsafe and misaligned, for reasons such as wireheading, delusion boxes, misspecified reward functions and distributional shifts. In this report, we categorize sources of misalignment for reinforcement learning agents, illustrating each type with numerous examples. For each type of problem, we also describe ways to remove the source of misalignment. Combined, the suggestions form high-level blueprints for how to design value aligned RL agents.

[1]  Marcus Hutter,et al.  Self-Modification of Policy and Utility Function in Rational Agents , 2016, AGI.

[2]  M. Schervish,et al.  State-Dependent Utilities , 1990 .

[3]  Mark O. Riedl,et al.  Using Stories to Teach Human Values to Artificial Agents , 2016, AAAI Workshop: AI, Ethics, and Society.

[4]  Laurent Orseau,et al.  AI Safety Gridworlds , 2017, ArXiv.

[5]  Laurent Orseau,et al.  Safely Interruptible Agents , 2016, UAI.

[6]  Jürgen Schmidhuber,et al.  Goedel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements , 2003, ArXiv.

[7]  Jon Bird,et al.  The evolved radio and its implications for modelling the evolution of novel sensors , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[8]  Laurent Orseau,et al.  Universal knowledge-seeking agents , 2011, Theor. Comput. Sci..

[9]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[10]  Jürgen Schmidhuber,et al.  World Models , 2018, ArXiv.

[11]  Laurent Orseau,et al.  Delusion, Survival, and Intelligent Agents , 2011, AGI.

[12]  James L Olds,et al.  Positive reinforcement produced by electrical stimulation of septal area and other regions of rat brain. , 1954, Journal of comparative and physiological psychology.

[13]  G. Smith ANARCHY, STATE, AND UTOPIA , 1976 .

[14]  Stuart Armstrong,et al.  'Indifference' methods for managing agent rewards , 2017, ArXiv.

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[17]  Marcus Hutter,et al.  A Causal Influence Diagram Perspective , 2019 .

[18]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[19]  N Wiener,et al.  Some moral and technical consequences of automation , 1960, Science.

[20]  Roman V. Yampolskiy,et al.  Artificial Superintelligence: A Futuristic Approach , 2015 .

[21]  Marcus Hutter,et al.  Avoiding Wireheading with Value Reinforcement Learning , 2016, AGI.

[22]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[23]  J. Schreiber Foundations Of Statistics , 2016 .

[24]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[25]  Marcus Hutter A Complete Theory of Everything (Will Be Subjective) , 2010, Algorithms.

[26]  M. Strathern ‘Improving ratings’: audit in the British University system , 1997, European Review.

[27]  Pieter Abbeel,et al.  An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[28]  Risto Miikkulainen,et al.  The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities , 2018, Artificial Life.

[29]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[30]  C. Robert Superintelligence: Paths, Dangers, Strategies , 2017 .

[31]  David A. Rottenberg,et al.  Compulsive thalamic self-stimulation: A case with metabolic, electrophysiologic and behavioral correlates , 1986, Pain.

[32]  Bill Hibbard,et al.  Model-based Utility Functions , 2011, J. Artif. Gen. Intell..

[33]  Owain Evans,et al.  Trial without Error: Towards Safe Reinforcement Learning via Human Intervention , 2017, AAMAS.

[34]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[35]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[36]  Stephen Omohundro,et al.  The Nature of Self-Improving Artificial Intelligence , 2008 .

[37]  Tor Lattimore,et al.  General time consistent discounting , 2014, Theor. Comput. Sci..

[38]  Marcus Hutter,et al.  Reward tampering problems and solutions in reinforcement learning: a causal influence diagram perspective , 2019, Synthese.