论文信息 - ML-LOO: Detecting Adversarial Examples with Feature Attribution - 字舞流文

ML-LOO: Detecting Adversarial Examples with Feature Attribution

Deep neural networks obtain state-of-the-art performance on a series of tasks. However, they are easily fooled by adding a small adversarial perturbation to input. The perturbation is often human imperceptible on image data. We observe a significant difference in feature attributions of adversarially crafted examples from those of original ones. Based on this observation, we introduce a new framework to detect adversarial examples through thresholding a scale estimate of feature attribution scores. Furthermore, we extend our method to include multi-layer feature attributions in order to tackle the attacks with mixed confidence levels. Through vast experiments, our method achieves superior performances in distinguishing adversarial examples from popular attack methods on a variety of real data sets among state-of-the-art detection methods. In particular, our method is able to detect adversarial examples of mixed confidence levels, and transfer between different attacking methods.

Michael I. Jordan | Cho-Jui Hsieh | Jianbo Chen | Jane-Ling Wang | Puyudi Yang | Cho-Jui Hsieh | Jane-ling Wang | Jianbo Chen | Puyudi Yang

[1] Motoaki Kawanabe,et al. How to Explain Individual Classification Decisions , 2009, J. Mach. Learn. Res..

[2] Zhitao Gong,et al. Adversarial and Clean Data Are Not Twins , 2017, aiDM@SIGMOD.

[3] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Ananthram Swami,et al. The Limitations of Deep Learning in Adversarial Settings , 2015, 2016 IEEE European Symposium on Security and Privacy (EuroS&P).

[5] Cho-Jui Hsieh,et al. Towards Robust Neural Networks via Random Self-ensemble , 2017, ECCV.

[6] Cho-Jui Hsieh,et al. Rob-GAN: Generator, Discriminator, and Adversarial Attacker , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Alexander Binder,et al. On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , 2015, PloS one.

[8] Yan Wang,et al. Detecting Adversarial Perturbations with Saliency , 2018 .

[9] Chia-Mu Yu,et al. On the Limitation of Local Intrinsic Dimensionality for Characterizing the Subspaces of Adversarial Examples , 2018, ICLR.

[10] Yanjun Qi,et al. Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks , 2017, NDSS.

[11] Carlos Guestrin,et al. "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[12] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[13] Ananthram Swami,et al. Practical Black-Box Attacks against Machine Learning , 2016, AsiaCCS.

[14] Avanti Shrikumar,et al. Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[15] Joan Bruna,et al. Intriguing properties of neural networks , 2013, ICLR.

[16] Jan Hendrik Metzen,et al. On Detecting Adversarial Perturbations , 2017, ICLR.

[17] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[18] Seyed-Mohsen Moosavi-Dezfooli,et al. DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Scott Lundberg,et al. A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[20] Lewis D. Griffin,et al. A Boundary Tilting Persepective on the Phenomenon of Adversarial Examples , 2016, ArXiv.

[21] Aleksander Madry,et al. There Is No Free Lunch In Adversarial Robustness (But There Are Unexpected Benefits) , 2018, ArXiv.

[22] Huichen Lihuichen. DECISION-BASED ADVERSARIAL ATTACKS: RELIABLE ATTACKS AGAINST BLACK-BOX MACHINE LEARNING MODELS , 2017 .

[23] Dawn Xiaodong Song,et al. Delving into Transferable Adversarial Examples and Black-box Attacks , 2016, ICLR.

[24] Aleksander Madry,et al. Adversarially Robust Generalization Requires More Data , 2018, NeurIPS.

[25] Seyed-Mohsen Moosavi-Dezfooli,et al. Robustness of classifiers: from adversarial to random noise , 2016, NIPS.

[26] Shin Ishii,et al. Distributional Smoothing with Virtual Adversarial Training , 2015, ICLR 2016.

[27] Le Song,et al. L-Shapley and C-Shapley: Efficient Model Interpretation for Structured Data , 2018, ICLR.

[28] Zachary Chase Lipton. The mythos of model interpretability , 2016, ACM Queue.

[29] Alois Knoll,et al. Guessing Smart: Biased Sampling for Efficient Black-Box Adversarial Attacks , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30] Andrew Zisserman,et al. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[31] J. Zico Kolter,et al. Provable defenses against adversarial examples via the convex outer adversarial polytope , 2017, ICML.

[32] Aleksander Madry,et al. Robustness May Be at Odds with Accuracy , 2018, ICLR.

[33] Abubakar Abid,et al. Interpretation of Neural Networks is Fragile , 2017, AAAI.

[34] Cho-Jui Hsieh,et al. Adv-BNN: Improved Adversarial Defense through Robust Bayesian Neural Network , 2018, ICLR.

[35] Patrick D. McDaniel,et al. Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples , 2016, ArXiv.

[36] Xinlei Chen,et al. Visualizing and Understanding Neural Models in NLP , 2015, NAACL.

[37] Xin Li,et al. Adversarial Examples Detection in Deep Networks with Convolutional Filter Statistics , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[38] David A. Wagner,et al. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples , 2018, ICML.

[39] Chih-Kuan Yeh,et al. On the (In)fidelity and Sensitivity for Explanations. , 2019, 1901.09392.

[40] Suman Jana,et al. Certified Robustness to Adversarial Examples with Differential Privacy , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[41] Daniel Jurafsky,et al. Understanding Neural Networks through Representation Erasure , 2016, ArXiv.

[42] Pascal Frossard,et al. Analysis of classifiers’ robustness to adversarial perturbations , 2015, Machine Learning.

[43] Yang Song,et al. PixelDefend: Leveraging Generative Models to Understand and Defend against Adversarial Examples , 2017, ICLR.

[44] Erik Strumbelj,et al. An Efficient Explanation of Individual Classifications using Game Theory , 2010, J. Mach. Learn. Res..

[45] Aleksander Madry,et al. Prior Convictions: Black-Box Adversarial Attacks with Bandits and Priors , 2018, ICLR.

[46] Logan Engstrom,et al. Black-box Adversarial Attacks with Limited Queries and Information , 2018, ICML.

[47] David A. Wagner,et al. Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[48] Kevin Gimpel,et al. Early Methods for Detecting Adversarial Images , 2016, ICLR.

[49] Ryan R. Curtin,et al. Detecting Adversarial Samples from Artifacts , 2017, ArXiv.

[50] P. Chalasani,et al. Adversarial Learning and Explainability in Structured Datasets. , 2018, 1810.06583.

[51] Pushmeet Kohli,et al. Training verified learners with learned verifiers , 2018, ArXiv.

[52] Patrick D. McDaniel,et al. On the (Statistical) Detection of Adversarial Examples , 2017, ArXiv.

[53] Dan Boneh,et al. Ensemble Adversarial Training: Attacks and Defenses , 2017, ICLR.

[54] Ananthram Swami,et al. Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks , 2015, 2016 IEEE Symposium on Security and Privacy (SP).

[55] Jinfeng Yi,et al. ZOO: Zeroth Order Optimization Based Black-box Attacks to Deep Neural Networks without Training Substitute Models , 2017, AISec@CCS.

[56] Yair Zick,et al. Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[57] Daniel Cullina,et al. Enhancing robustness of machine learning systems via data transformations , 2017, 2018 52nd Annual Conference on Information Sciences and Systems (CISS).

[58] James Bailey,et al. Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality , 2018, ICLR.

[59] Ankur Taly,et al. Axiomatic Attribution for Deep Networks , 2017, ICML.

[60] Samy Bengio,et al. Adversarial Machine Learning at Scale , 2016, ICLR.

[61] Kibok Lee,et al. A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks , 2018, NeurIPS.

[62] Xiangyu Zhang,et al. Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples , 2018, NeurIPS.

[63] Aleksander Madry,et al. Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[64] Jonathon Shlens,et al. Explaining and Harnessing Adversarial Examples , 2014, ICLR.