When and How to Fool Explainable Models (and Humans) with Adversarial Examples

Reliable deployment of machine learning models such as neural networks continues to be challenging due to several limitations. Some of the main shortcomings are the lack of interpretability and the lack of robustness against adversarial examples or out-of-distribution inputs. In this paper, we explore the possibilities and limits of adversarial attacks for explainable machine learning models. First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios, in which the inputs, the output classifications and the explanations of the model’s decisions are assessed by humans. Next, we propose a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable models under human assessment, introducing novel attack paradigms. In particular, our framework considers a wide range of relevant (yet often ignored) factors such as the type of problem, the user expertise or the objective of the explanations in order to identify the attack strategies that should be adopted in each scenario to successfully deceive the model (and the human). These contributions intend to serve as a basis for a more rigorous and realistic study of adversarial examples in the field of explainable machine learning.

[1]  Himabindu Lakkaraju,et al.  "How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations , 2019, AIES.

[2]  Cynthia Rudin,et al.  Deep Learning for Case-based Reasoning through Prototypes: A Neural Network that Explains its Predictions , 2017, AAAI.

[3]  Atul Prakash,et al.  Analyzing the Interpretability Robustness of Self-Explaining Models , 2019, ArXiv.

[4]  Ting Wang,et al.  Interpretable Deep Learning under Fire , 2018, USENIX Security Symposium.

[5]  Zhanxing Zhu,et al.  Interpreting Adversarially Trained Convolutional Neural Networks , 2019, ICML.

[6]  Thomas Villmann,et al.  Classification-by-Components: Probabilistic Modeling of Reasoning over a Set of Components , 2019, NeurIPS.

[7]  Carola-Bibiane Schönlieb,et al.  On the Connection Between Adversarial Robustness and Saliency Map Interpretability , 2019, ICML.

[8]  Tommi S. Jaakkola,et al.  On the Robustness of Interpretability Methods , 2018, ArXiv.

[9]  Andrew Slavin Ross,et al.  Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients , 2017, AAAI.

[10]  Marcel van Gerven,et al.  Explainable Deep Learning: A Field Guide for the Uninitiated , 2020, J. Artif. Intell. Res..

[11]  Hod Lipson,et al.  Understanding Neural Networks Through Deep Visualization , 2015, ArXiv.

[12]  Klaus-Robert Müller,et al.  Explanations can be manipulated and geometry is to blame , 2019, NeurIPS.

[13]  Peter Tiňo,et al.  A Survey on Neural Network Interpretability , 2020, IEEE Transactions on Emerging Topics in Computational Intelligence.

[14]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[15]  Sameer Singh,et al.  Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods , 2020, AIES.

[16]  Tommi S. Jaakkola,et al.  Towards Robust Interpretability with Self-Explaining Neural Networks , 2018, NeurIPS.

[17]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[18]  C. Rudin,et al.  Concept whitening for interpretable image recognition , 2020, Nature Machine Intelligence.

[19]  Luca Viganò,et al.  Explainable Security , 2018, 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW).

[20]  Martin Wattenberg,et al.  Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) , 2017, ICML.

[21]  Chris Russell,et al.  Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR , 2017, ArXiv.

[22]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[23]  Taesup Moon,et al.  Fooling Neural Network Interpretations via Adversarial Model Manipulation , 2019, NeurIPS.

[24]  Quanshi Zhang,et al.  Interpretable Convolutional Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  James Zou,et al.  Towards Automatic Concept-based Explanations , 2019, NeurIPS.

[26]  Lalana Kagal,et al.  Explaining Explanations: An Overview of Interpretability of Machine Learning , 2018, 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA).

[27]  Nhien-An Le-Khac,et al.  Black Box Attacks on Explainable Artificial Intelligence(XAI) methods in Cyber Security , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[28]  Cynthia Rudin,et al.  This Looks Like That: Deep Learning for Interpretable Image Recognition , 2018 .

[29]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[30]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[31]  Joao Marques-Silva,et al.  On Relating Explanations and Adversarial Examples , 2019, NeurIPS.

[32]  Ruocheng Guo,et al.  Adversarial Attacks and Defenses: An Interpretation Perspective , 2020 .

[33]  Cynthia Rudin,et al.  Interpretable Image Recognition with Hierarchical Prototypes , 2019, HCOMP.

[34]  Dumitru Erhan,et al.  The (Un)reliability of saliency methods , 2017, Explainable AI.

[35]  Aleksander Madry,et al.  Robustness May Be at Odds with Accuracy , 2018, ICLR.

[36]  Abubakar Abid,et al.  Interpretation of Neural Networks is Fragile , 2017, AAAI.