Causal Analysis of Agent Behavior for AI Safety

As machine learning systems become more powerful they also become increasingly unpredictable and opaque. Yet, finding human-understandable explanations of how they work is essential for their safe deployment. This technical report illustrates a methodology for investigating the causal mechanisms that drive the behaviour of artificial agents. Six use cases are covered, each addressing a typical question an analyst might ask about an agent. In particular, we show that each question cannot be addressed by pure observation alone, but instead requires conducting experiments with systematically chosen manipulations so as to generate the correct causal evidence.

[1]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[2]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[3]  Shane Legg,et al.  Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings , 2019, ArXiv.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Jeremy Nixon,et al.  Resolving Spurious Correlations in Causal Models of Environments via Interventions , 2020, ArXiv.

[6]  D. Braddon-Mitchell NATURE'S CAPACITIES AND THEIR MEASUREMENT , 1991 .

[7]  Mélanie Frappier,et al.  The Book of Why: The New Science of Cause and Effect , 2018, Science.

[8]  Abhinav Verma,et al.  Programmatically Interpretable Reinforcement Learning , 2018, ICML.

[9]  Eric M. S. P. Veith,et al.  Explainable Reinforcement Learning: A Survey , 2020, CD-MAKE.

[10]  Alex Mott,et al.  Towards Interpretable Reinforcement Learning Using Attention Augmented Agents , 2019, NeurIPS.

[11]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[12]  A. Dawid Causal Inference without Counterfactuals , 2000 .

[13]  Daniel Polani,et al.  Information Theory of Decisions and Actions , 2011 .

[14]  Wojciech Samek,et al.  Methods for interpreting and understanding deep neural networks , 2017, Digit. Signal Process..

[15]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[16]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[17]  Pushmeet Kohli,et al.  Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures , 2018, ICLR.

[18]  Viktor Mikhaĭlovich Glushkov,et al.  An Introduction to Cybernetics , 1957, The Mathematical Gazette.

[19]  Shane Legg,et al.  The Incentives that Shape Behaviour , 2020, ArXiv.

[20]  Tom Burr,et al.  Causation, Prediction, and Search , 2003, Technometrics.

[21]  A. Dawid,et al.  Statistical Causality from a Decision-Theoretic Perspective , 2014, 1405.2292.

[22]  Joseph Y. Halpern,et al.  Actual causation and the art of modeling , 2011, ArXiv.

[23]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[24]  Elias Bareinboim,et al.  Bandits with Unobserved Confounders: A Causal Approach , 2015, NIPS.

[25]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[26]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[27]  Marcus Hutter,et al.  A Philosophical Treatise of Universal Induction , 2011, Entropy.

[28]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[29]  Ilya Shpitser,et al.  Counterfactual Graphical Models for Longitudinal Mediation Analysis With Unobserved Confounding , 2012, Cogn. Sci..

[30]  Silvia Chiappa,et al.  Path-Specific Counterfactual Fairness , 2018, AAAI.

[31]  J. I The Design of Experiments , 1936, Nature.

[32]  Wilhelm Cauer,et al.  Theorie der linearen Wechselstromschaltungen , 1940 .

[33]  F. H. Adler Cybernetics, or Control and Communication in the Animal and the Machine. , 1949 .

[34]  Stuart J. Russell,et al.  Research Priorities for Robust and Beneficial Artificial Intelligence , 2015, AI Mag..

[35]  David Lopez-Paz,et al.  Invariant Risk Minimization , 2019, ArXiv.

[36]  J. Pearl,et al.  Causal Inference in Statistics: A Primer , 2016 .

[37]  Bram Bakker,et al.  Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[38]  J. Tenenbaum,et al.  Structure and strength in causal induction , 2005, Cognitive Psychology.

[39]  D. Olton Mazes, maps, and memory. , 1979, The American psychologist.