Black-box attacks on deep neural networks via gradient estimation

Arjun Nitin Bhagoji, Warren He, Bo Li, Dawn Song

Research output: Contribution to conferencePaperpeer-review

Abstract

In this paper, we propose novel Gradient Estimation black-box attacks to generate adversarial examples with query access to the target model’s class probabilities, which do not rely on transferability. We also propose strategies to decouple the number of queries required to generate each adversarial example from the dimensionality of the input. An iterative variant of our attack achieves close to 100% attack success rates for both targeted and untargeted attacks on DNNs. We show that the proposed Gradient Estimation attacks outperform all other black-box attacks we tested on both MNIST and CIFAR-10 datasets, achieving attack success rates similar to well known, state-of-the-art white-box attacks. We also apply the Gradient Estimation attacks successfully against a real-world content moderation classifier hosted by Clarifai.

Original languageEnglish (US)
StatePublished - 2018
Externally publishedYes
Event6th International Conference on Learning Representations, ICLR 2018 - Vancouver, Canada
Duration: Apr 30 2018May 3 2018

Conference

Conference6th International Conference on Learning Representations, ICLR 2018
Country/TerritoryCanada
CityVancouver
Period4/30/185/3/18

ASJC Scopus subject areas

  • Education
  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'Black-box attacks on deep neural networks via gradient estimation'. Together they form a unique fingerprint.

Cite this