The Alignment Problem for History-Based Bayesian Reinforcement Learners ∗ PUBLIC DRAFT

Future artificial intelligences may be many times smarter than humans (Bostrom, 2014). If humans should have any chance of controlling such systems, their goals better be aligned with our human goals. Unfortunately, the goals of RL agents as designed today are heavily misaligned with human values for a number of reasons. In this paper, we categorize sources of misalignment, and give examples for each type. We also describe a range of tools for managing misalignment. Combined, the tools yield a number of aligned AI designs, though much future work remains for assessing their practical feasibility.

[1]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[2]  Tor Lattimore,et al.  General time consistent discounting , 2014, Theor. Comput. Sci..

[3]  Mark O. Riedl,et al.  Using Stories to Teach Human Values to Artificial Agents , 2016, AAAI Workshop: AI, Ethics, and Society.

[4]  Laurent Orseau,et al.  AI Safety Gridworlds , 2017, ArXiv.

[5]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[6]  Marcus Hutter,et al.  The Alignment Problem for Bayesian History-Based Reinforcement Learners∗ , 2019 .

[7]  N Wiener,et al.  Some moral and technical consequences of automation , 1960, Science.

[8]  Jürgen Schmidhuber,et al.  World Models , 2018, ArXiv.

[9]  I. J. Good,et al.  Speculations Concerning the First Ultraintelligent Machine , 1965, Adv. Comput..

[10]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[12]  David A. Rottenberg,et al.  Compulsive thalamic self-stimulation: A case with metabolic, electrophysiologic and behavioral correlates , 1986, Pain.

[13]  Bill Hibbard,et al.  Model-based Utility Functions , 2011, J. Artif. Gen. Intell..

[14]  James L Olds,et al.  Positive reinforcement produced by electrical stimulation of septal area and other regions of rat brain. , 1954, Journal of comparative and physiological psychology.

[15]  Stuart Armstrong,et al.  'Indifference' methods for managing agent rewards , 2017, ArXiv.

[16]  Marcus Hutter,et al.  Avoiding Wireheading with Value Reinforcement Learning , 2016, AGI.

[17]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[18]  Dr. Marcus Hutter,et al.  Universal artificial intelligence , 2004 .

[19]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[20]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[21]  Dawn Song,et al.  Robust Physical-World Attacks on Deep Learning Models , 2017, 1707.08945.

[22]  Jürgen Schmidhuber,et al.  Goedel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements , 2003, ArXiv.

[23]  Jon Bird,et al.  The evolved radio and its implications for modelling the evolution of novel sensors , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[24]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[25]  M. Schervish,et al.  State-Dependent Utilities , 1990 .

[26]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[27]  Roman V. Yampolskiy,et al.  Artificial Superintelligence: A Futuristic Approach , 2015 .

[28]  M. Strathern ‘Improving ratings’: audit in the British University system , 1997, European Review.