Scalable agent alignment via reward modeling: a research direction

One obstacle to applying reinforcement learning algorithms to real-world problems is the lack of suitable reward functions. Designing such reward functions is difficult in part because the user only has an implicit understanding of the task objective. This gives rise to the agent alignment problem: how do we create agents that behave in accordance with the user's intentions? We outline a high-level research direction to solve the agent alignment problem centered around reward modeling: learning a reward function from interaction with the user and optimizing the learned reward function with reinforcement learning. We discuss the key challenges we expect to face when scaling reward modeling to complex and general domains, concrete approaches to mitigate these challenges, and ways to establish trust in the resulting agents.

[1]  J. Zico Kolter,et al.  Scaling provable adversarial defenses , 2018, NeurIPS.

[2]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[3]  Dean Pomerleau,et al.  Efficient Training of Artificial Neural Networks for Autonomous Navigation , 1991, Neural Computation.

[4]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[5]  Andrea Lockerd Thomaz,et al.  Policy Shaping: Integrating Human Feedback with Reinforcement Learning , 2013, NIPS.

[6]  Michael L. Littman,et al.  Deep Reinforcement Learning from Policy-Dependent Human Feedback , 2019, ArXiv.

[7]  Tor Lattimore,et al.  Online Learning with Gated Linear Networks , 2017, ArXiv.

[8]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[9]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[10]  Samuel J. Gershman,et al.  Human-in-the-Loop Interpretability Prior , 2018, NeurIPS.

[11]  Pushmeet Kohli,et al.  Adversarial Risk and the Dangers of Evaluating Against Weak Attacks , 2018, ICML.

[12]  Roman V. Yampolskiy,et al.  Leakproofing the Singularity Artificial Intelligence Confinement Problem , 2012 .

[13]  Pushmeet Kohli,et al.  Learning to Follow Language Instructions with Adversarial Reward Induction , 2018, ArXiv.

[14]  Dario Amodei,et al.  Supervising strong learners by amplifying weak experts , 2018, ArXiv.

[15]  Owain Evans,et al.  Active Reinforcement Learning with Monte-Carlo Tree Search , 2018, ArXiv.

[16]  Laurent Orseau,et al.  AI Safety Gridworlds , 2017, ArXiv.

[17]  Pieter Abbeel,et al.  Third-Person Imitation Learning , 2017, ICLR.

[18]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[19]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[20]  Michèle Sebag,et al.  APRIL: Active Preference-learning based Reinforcement Learning , 2012, ECML/PKDD.

[21]  Laurent Orseau,et al.  Penalizing Side Effects using Stepwise Relative Reachability , 2018, AISafety@IJCAI.

[22]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[23]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[24]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[25]  Guy Lever,et al.  Human-level performance in 3D multiplayer games with population-based reinforcement learning , 2018, Science.

[26]  Johannes Fürnkranz,et al.  A Survey of Preference-Based Reinforcement Learning Methods , 2017, J. Mach. Learn. Res..

[27]  Nando de Freitas,et al.  Playing hard exploration games by watching YouTube , 2018, NeurIPS.

[28]  Eyke Hüllermeier,et al.  Preference-based reinforcement learning: a formal framework and a policy iteration algorithm , 2012, Mach. Learn..

[29]  Stuart Armstrong,et al.  Low Impact Artificial Intelligences , 2017, ArXiv.

[30]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[31]  P. Samuelson A Note on the Pure Theory of Consumer's Behaviour , 1938 .

[32]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[33]  Anca D. Dragan,et al.  Translating Neuralese , 2017, ACL.

[34]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[35]  Jürgen Schmidhuber,et al.  Goedel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements , 2003, ArXiv.

[36]  Benja Fallenstein,et al.  Robust Cooperation in the Prisoner's Dilemma: Program Equilibrium via Provability Logic , 2014, ArXiv.

[37]  Nan Jiang,et al.  Repeated Inverse Reinforcement Learning , 2017, NIPS.

[38]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[39]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[40]  Jonathan Dodge,et al.  Visualizing and Understanding Atari Agents , 2017, ICML.

[41]  Amnon Shashua,et al.  On a Formal Model of Safe and Scalable Self-driving Cars , 2017, ArXiv.

[42]  Nicholas Carlini,et al.  Unrestricted Adversarial Examples , 2018, ArXiv.

[43]  John Salvatier,et al.  Active Reinforcement Learning: Observing Rewards at a Cost , 2020, ArXiv.

[44]  N. Soares,et al.  Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda , 2017 .

[45]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[46]  Shane Legg,et al.  Universal Intelligence: A Definition of Machine Intelligence , 2007, Minds and Machines.

[47]  Stephen M. Omohundro,et al.  The Basic AI Drives , 2008, AGI.

[48]  Pushmeet Kohli,et al.  Learning to Understand Goal Specifications by Modelling Reward , 2018, ICLR.

[49]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[50]  David A. Wagner,et al.  Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples , 2018, ICML.

[51]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[52]  Nick Bostrom,et al.  Superintelligence: Paths, Dangers, Strategies , 2014 .

[53]  Ryan P. Adams,et al.  Motivating the Rules of the Game for Adversarial Example Research , 2018, ArXiv.

[54]  Stefano Ermon,et al.  Accurate Uncertainties for Deep Learning Using Calibrated Regression , 2018, ICML.

[55]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[56]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[57]  Risto Miikkulainen,et al.  The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities , 2018, Artificial Life.

[58]  Martin Wattenberg,et al.  TCAV: Relative concept importance testing with Linear Concept Activation Vectors , 2018 .

[59]  P. Samuelson A Note on the Pure Theory of Consumer's Behaviour: An Addendum , 1938 .

[60]  Laurent Orseau,et al.  Measuring and avoiding side effects using relative reachability , 2018, ArXiv.

[61]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[62]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[63]  Zoubin Ghahramani,et al.  Deep Bayesian Active Learning with Image Data , 2017, ICML.

[64]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[65]  Laurent Orseau,et al.  Reinforcement Learning with a Corrupted Reward Channel , 2017, IJCAI.

[66]  Pushmeet Kohli,et al.  A Dual Approach to Scalable Verification of Deep Networks , 2018, UAI.

[67]  Arvind Satyanarayan,et al.  The Building Blocks of Interpretability , 2018 .

[68]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[69]  Stuart J. Russell,et al.  Research Priorities for Robust and Beneficial Artificial Intelligence , 2015, AI Mag..

[70]  S. Schneider Science fiction and philosophy : from time travel to superintelligence , 2016 .

[71]  Marcus Hutter,et al.  Bad Universal Priors and Notions of Optimality , 2015, COLT.

[72]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[73]  James Babcock,et al.  The AGI Containment Problem , 2016, AGI.

[74]  Laurent Orseau,et al.  Delusion, Survival, and Intelligent Agents , 2011, AGI.

[75]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[76]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[77]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[78]  Jessica Taylor,et al.  A Formal Solution to the Grain of Truth Problem , 2016, UAI.

[79]  Peter Stone,et al.  Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces , 2017, AAAI.

[80]  Oren Etzioni,et al.  The First Law of Robotics (A Call to Arms) , 1994, AAAI.

[81]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[82]  Pushmeet Kohli,et al.  Training verified learners with learned verifiers , 2018, ArXiv.

[83]  Tom McGrath,et al.  FHI Oxford Technical Report # 2018-2 Predicting Human Deliberative Judgments with Machine Learning , 2018 .

[84]  Nate Soares,et al.  The Value Learning Problem , 2018, Artificial Intelligence Safety and Security.

[85]  Aaron C. Courville,et al.  Systematic Generalization: What Is Required and Can It Be Learned? , 2018, ICLR.

[86]  Sanjeev Arora,et al.  Computational Complexity: A Modern Approach , 2009 .

[87]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[88]  Laurent Orseau,et al.  Space-Time Embedded Intelligence , 2012, AGI.

[89]  Armando Solar-Lezama,et al.  Verifiable Reinforcement Learning via Policy Extraction , 2018, NeurIPS.

[90]  Katerina Fragkiadaki,et al.  Reward Learning from Narrated Demonstrations , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[91]  Owain Evans,et al.  Trial without Error: Towards Safe Reinforcement Learning via Human Intervention , 2017, AAMAS.

[92]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[93]  Marcus Hutter,et al.  AGI Safety Literature Review , 2018, IJCAI.

[94]  Dr. Marcus Hutter,et al.  Universal artificial intelligence , 2004 .

[95]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[96]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[97]  Seth D. Baum,et al.  Social choice ethics in artificial intelligence , 2017, AI & SOCIETY.

[98]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[99]  Stuart Armstrong,et al.  Motivated Value Selection for Artificial Agents , 2015, AAAI Workshop: AI and Ethics.

[100]  Guan Wang,et al.  Interactive Learning from Policy-Dependent Human Feedback , 2017, ICML.

[101]  Romain Laroche,et al.  Score-based Inverse Reinforcement Learning , 2016, AAMAS.

[102]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[103]  Anca D. Dragan,et al.  Should Robots be Obedient? , 2017, IJCAI.

[104]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[105]  Peter Stone,et al.  Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[106]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[107]  Anca D. Dragan,et al.  Inverse Reward Design , 2017, NIPS.

[108]  Jessica Taylor,et al.  Alignment for Advanced Machine Learning Systems , 2020, Ethics of Artificial Intelligence.

[109]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[110]  Martin Wattenberg,et al.  Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) , 2017, ICML.

[111]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[112]  David Barber,et al.  Thinking Fast and Slow with Deep Learning and Tree Search , 2017, NIPS.

[113]  Alex Kendall,et al.  Concrete Dropout , 2017, NIPS.

[114]  Marcus Hutter,et al.  The Alignment Problem for Bayesian History-Based Reinforcement Learners∗ , 2019 .

[115]  Chris Dyer,et al.  Neural Arithmetic Logic Units , 2018, NeurIPS.

[116]  Laurent Orseau,et al.  Safely Interruptible Agents , 2016, UAI.

[117]  B. Abramson The expected-outcome model of two-player games , 1990 .

[118]  Stefan Riezler,et al.  Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning , 2018, ACL.

[119]  Anca D. Dragan,et al.  The Off-Switch Game , 2016, IJCAI.

[120]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[121]  Dario Amodei,et al.  AI safety via debate , 2018, ArXiv.

[122]  Samy Bengio,et al.  A Study on Overfitting in Deep Reinforcement Learning , 2018, ArXiv.

[123]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[124]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[125]  Stuart Armstrong,et al.  Occam's razor is insufficient to infer the preferences of irrational agents , 2017, NeurIPS.

[126]  Daniel Dewey,et al.  Learning What to Value , 2011, AGI.

[127]  Laurent Orseau,et al.  Asymptotic non-learnability of universal agents with computable horizon functions , 2013, Theor. Comput. Sci..

[128]  Shie Mannor,et al.  Graying the black box: Understanding DQNs , 2016, ICML.

[129]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[130]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[131]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[132]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[133]  Pushmeet Kohli,et al.  Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures , 2018, ICLR.

[134]  Mykel J. Kochenderfer,et al.  Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks , 2017, CAV.

[135]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[136]  N Wiener,et al.  Some moral and technical consequences of automation , 1960, Science.

[137]  Dan Klein,et al.  Learning with Latent Language , 2017, NAACL.

[138]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[139]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[140]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[141]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[142]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[143]  Mark O. Riedl,et al.  Using Stories to Teach Human Values to Artificial Agents , 2016, AAAI Workshop: AI, Ethics, and Society.

[144]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[145]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[146]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[147]  W. Bradley Knox,et al.  Learning from human-generated reward , 2012 .

[148]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[149]  Rishi Sharma,et al.  A Note on the Inception Score , 2018, ArXiv.

[150]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[151]  Benjamin Bruno Meier,et al.  Deep Learning in the Wild , 2018, ANNPR.

[152]  Andrea Lockerd Thomaz,et al.  Teachable robots: Understanding human teaching behavior to build more effective robot learners , 2008, Artif. Intell..

[153]  John Salvatier,et al.  Agent-Agnostic Human-in-the-Loop Reinforcement Learning , 2017, ArXiv.

[154]  James J. Little,et al.  Does Your Model Know the Digit 6 Is Not a Cat? A Less Biased Evaluation of "Outlier" Detectors , 2018, ArXiv.

[155]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[156]  Razvan Pascanu,et al.  Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[157]  Gary Marcus,et al.  Deep Learning: A Critical Appraisal , 2018, ArXiv.

[158]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[159]  Noah D. Goodman,et al.  Learning the Preferences of Ignorant, Inconsistent Agents , 2015, AAAI.

[160]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[161]  Edmund H. Durfee,et al.  Minimax-Regret Querying on Side Effects for Safe Optimality in Factored Markov Decision Processes , 2018, IJCAI.

[162]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[163]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[164]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[165]  Shane Legg,et al.  Reward learning from human preferences and demonstrations in Atari , 2018, NeurIPS.

[166]  Dawn Xiaodong Song,et al.  Making Neural Programming Architectures Generalize via Recursion , 2017, ICLR.

[167]  Tom Everitt,et al.  Towards Safe Artificial General Intelligence , 2018 .

[168]  Michèle Sebag,et al.  Programming by Feedback , 2014, ICML.

[169]  Sandy H. Huang,et al.  Adversarial Attacks on Neural Network Policies , 2017, ICLR.