Learning to Attack: Adversarial Transformation Networks

With the rapidly increasing popularity of deep neural networks for image recognition tasks, a parallel interest in generating adversarial examples to attack the trained models has arisen. To date, these approaches have involved either directly computing gradients with respect to the image pixels or directly solving an optimization on the image pixels. We generalize this pursuit in a novel direction: can a separate network be trained to efficiently attack another fully trained network? We demonstrate that it is possible, and that the generated attacks yield startling insights into the weaknesses of the target network. We call such a network an Adversarial Transformation Network (ATN). ATNs transform any input into an adversarial attack on the target network, while being minimally perturbing to the original inputs and the target network’s outputs. Further, we show that ATNs are capable of not only causing the target network to make an error, but can be constructed to explicitly control the type of misclassification made. We demonstrate ATNs on both simple MNISTdigit classifiers and state-of-the-art ImageNet classifiers deployed by Google, Inc.: Inception ResNet-v2. With the resurgence of deep neural networks for many real-world classification tasks, there is an increased interest in methods to assess the weaknesses in the trained models. Adversarial examples are small perturbations of the inputs that are carefully crafted to fool the network into producing incorrect outputs. Seminal work by (Szegedy et al. 2013) and (Goodfellow, Shlens, and Szegedy 2014), as well as much recent work, has shown that adversarial examples are abundant, and that there are many ways to discover them. Given a classifier f(x) : x ∈ X → y ∈ Y and original inputs x ∈ X , the problem of generating untargeted adversarial examples can be expressed as the optimization: argminx∗ L(x,x ∗) s.t. f(x∗) = f(x), where L(·) is a distance metric between examples from the input space (e.g., the L2 norm). Similarly, generating a targeted adversarial attack on a classifier can be expressed as argminx∗ L(x,x ∗) s.t. f(x∗) = yt, where yt ∈ Y is some target label chosen by the attacker. Until now, these optimization problems have been solved using three broad approaches: (1) By directly using optimizers like L-BFGS or Adam (Kingma and Ba 2015), as Copyright c © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. proposed in (Szegedy et al. 2013) and (Carlini and Wagner 2016). (2) By approximation with single-step gradient-based techniques like fast gradient sign (Goodfellow, Shlens, and Szegedy 2014) or fast least likely class (Kurakin, Goodfellow, and Bengio 2016). (3) By approximation with iterative variants of gradient-based techniques (Kurakin, Goodfellow, and Bengio 2016; Moosavi-Dezfooli et al. 2016; Moosavi-Dezfooli, Fawzi, and Frossard 2016). These approaches use multiple forward and backward passes through the target network to more carefully move an input towards an adversarial classification. Other approaches assume a black-box model and only having access to the target model’s output (Papernot et al. 2016; Baluja, Covell, and Sukthankar 2015; Tramèr et al. 2016). See (Papernot et al. 2015) for a discussion of threat models. Each of the above approaches solved an optimization problem such that a single set of inputs was perturbed enough to force the target network to make a mistake. We take a fundamentally different approach: given a welltrained target network, can we create a separate, attacknetwork that, with high probability, minimally transforms all inputs into ones that will be misclassified? No per-sample optimization problems should be solved. The attack-network should take as input a clean image and output a minimally modified image that will cause a misclassification in the target network. Further, can we do this while imposing strict constraints on the types and amount of perturbations allowed? We introduce a class of networks, called Adversarial Transformation Networks, to efficiently address this task. Adversarial Transformation Networks In this work, we propose Adversarial Transformation Networks (ATNs). An ATN is a neural network that transforms an input into an adversarial example against a target network or set of networks. ATNs may be untargeted or targeted, and trained in a black-box or white-box manner. In this work, we will focus on targeted, white-box ATNs. Formally, an ATN can be defined as a neural network: gf,θ(x) : x ∈ X → x′ (1) where θ is the parameter vector of g, f is the target network which outputs a probability distribution across class labels, and x′ ∼ x, but argmax f(x) = argmax f(x′). The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

[1]  Dawn Xiaodong Song,et al.  Delving into Transferable Adversarial Examples and Black-box Attacks , 2016, ICLR.

[2]  Alexander Mordvintsev,et al.  Inceptionism: Going Deeper into Neural Networks , 2015 .

[3]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[5]  Seyed-Mohsen Moosavi-Dezfooli,et al.  DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Seyed-Mohsen Moosavi-Dezfooli,et al.  Universal Adversarial Perturbations , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Ananthram Swami,et al.  The Limitations of Deep Learning in Adversarial Settings , 2015, 2016 IEEE European Symposium on Security and Privacy (EuroS&P).

[8]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[9]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[10]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[11]  Jason Yosinski,et al.  Deep neural networks are easily fooled: High confidence predictions for unrecognizable images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[13]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[16]  Fan Zhang,et al.  Stealing Machine Learning Models via Prediction APIs , 2016, USENIX Security Symposium.

[17]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Ananthram Swami,et al.  Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples , 2016, ArXiv.

[20]  Jan Hendrik Metzen,et al.  On Detecting Adversarial Perturbations , 2017, ICLR.

[21]  Rahul Sukthankar,et al.  The Virtues of Peer Pressure: A Simple Method for Discovering High-Value Mistakes , 2015, CAIP.