ABSTRACT

Title of Dissertation: ANALYSIS OF DATA SECURITY
VULNERABILITIES IN DEEP LEARNING

Liam Fowl
Doctor of Philosophy, 2022

Dissertation Directed by: Professor Wojciech Czaja
Department of Mathematics

Professor Thomas Goldstein
Department of Computer Science

As deep learning systems become more integrated into important application areas,

the security of such systems becomes a paramount concern. Specifically, as modern net-

works require an increasing amount of data on which to train, the security of data that

is collected for these models cannot be guaranteed. In this work, we investigate several

security vulnerabilities and security applications of the data pipeline for deep learning

systems. We systematically evaluate the risks and mechanisms of data security from mul-

tiple perspectives, ranging from users to large companies and third parties, and reveal

several security mechanisms and vulnerabilities that are of interest to machine learning

practitioners.


ANALYSIS OF DATA SECURITY
VULNERABILITIES IN DEEP LEARNING

by

Liam Fowl

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2022

Advisory Committee:
Professor Wojciech Czaja, Chair/Advisor
Professor Thomas Goldstein, Co-Advisor
Professor Vincent Lyzinski
Professor John Dickerson
Professor William Gasarch, Dean’s Representative


Acknowledgments

There are so many people who have made it possible for me to complete my PhD.

Academically, I owe much of my progress to my advisors: Wojtek Czaja and Tom Gold-

stein. Wojtek made me feel welcome in the department in my early years, and introduced

me to machine learning. Furthermore, he has been an invaluable resource for projects,

as well as university-related issues. Likewise, Tom welcomed me into his group and has

been incredibly helpful with project direction. His insight and feedback about deep learn-

ing research has allowed me to publish several projects. I could not havae navigated the

field of deep learning research without the help of both Wojtek and Tom.

Also key to my research has been collaborators. They have provided invaluable

help on projects, as well as camraderie and advice. These collaborators include Jonas

Geiping, Micah Goldblum, Zeyad Emam, Avi Schwarzschild, Ping-yeh Chiang, Ronny

Huang, Gowthami Somepalli, Arpit Bansal, Chen Zhu, Yuxin Wen, Steven Reich, Renkun

Ni, Valeria Cherapanova, Eitan Borgnia, Amin Ghiasi, Gavin Taylor, Michael Moller, Ali

Shafahi, Nathaniel Monson, and many others.

Outside of academia, my friends have made my years in grad school incredibly

enjoyable and I cannot thank them enough for the support that they provide.

Finally, I would like to thank my family who have given me encouragement and

support throughout my years at UMD. My girlfriend, Carolina, has been incredibly pa-

ii


tient and compassionate with me - especially in the last months of my program. Addi-

tionally, I’ve cherished the support from my brother, Brendan, and sister-in-law, Maddie.

My uncles and aunts, as well as my grandparents also deserve my thanks and acknow-

ledgement for their support throughout this process. Last, but not least, I owe so much to

my parents who made college possible for me, and have shown me nothing but love and

support throughout all my years of schooling.

iii


Introduction

Deep learning has rapidly become the state-of-the-art approach for both image pro-

cessing tasks [137] as well as natural language processing tasks [162]. This has led to sig-

nificant advances in applications such as object detection [133], medical imaging [134],

and several other important tasks [50, 93, 115]. This state-of-the-art performance has been

fueled by both the increased capacity of modern networks, and also the increased availab-

ility of a large amount of data on which to train modern networks. High performing vision

models are often trained on millions of images, and large language models can be trained

on hundreds of millions of tokens [32, 33]. These data-hungry models are fed data that is

often not curated. The security of this data is often not guaranteed and security risks to

practitioners arise because of this. The data-collection pipeline can be roughly described

by the interaction between three parties: companies, users, third parties. Each party has

different goals with regards to data collection and security, and each interacts with the

other two in unique ways. In this work, we study different aspects of this pipeline.

In Chapter 1, we study the security of data collected from third parties by compan-

ies. Here we study the threat of targeted data poisoning attacks wherein a third party

disseminates mmaliciously manipulated data that is then scraped and used in training by

a practitioner. We are the first work to show that modern deep networks trained at an

industrial scale are vulnerable to this type of attack. Previous attacks worked only in

iv


toy settings on simple datasets. Furthermore, we achieve state-of-the-art results on previ-

ous benchmarks all while showing significant computational advantages to our method -

gradient alignment.

In Chapter 2, we study the interaction between companies/practitioners and third

parties from the opposite perspective. We begin by motivating the concept of secure

dataset release as a way for companies, like social media companies, to release user data

while also preventing third parties from scraping said data and training a high-utility

model on such data. We formally phrase this problem as an availability poisoning attack

and we show that adversarial examples, originally intended as test-time attacks on neural

networks, make highly effective availability poisons. We show that such poisons can

often degrade accuracy of a network trained on this data to below random accuracy levels.

We analyze the mechanism by which this attack works, and compare to several existing

poisoning methods, finding that our method achieves state-of-the-art results on several

datasets.

In Chapter 3, we take the perspective of users and evaluate the security of feder-

ated learning systems implemented by companies and third parties. We find that secure

aggregation, which was previously thought to be a sufficient mechanism to ensure user

privacy in practice, can be circumvented with malicious servers who wish to breach user

privacy. We accomplish this task by introducing malicious model and parameter modi-

fications to the federated learning pipeline that create structured gradient entries in the

shared model updates. This can reveal verbatim copies of user data to the server. We

show that our method outperforms previous “honest-but-curious” attacks by orders of

magnitude. Our work suggests that further privacy techniques are needed to ensure user

v


data is secure.

In Chapter 4, we investigate the safety of federated learning for text data. Text data

is perhaps the most important application area for federated learning, and one where we

know federated learning is actually practiced. We introduce a new threat model of mali-

cious parameters where a server is allowed to send “snapshots” of malicious parameters,

but is not allowed to modify the underlying architecture. We show that Transformer-based

language models are especially vulnerable to privacy attacks due to the large linear layers

included in Transformer blocks. We show that recovered embeddings can be matched to

know positions and tokens, and we introduce an attention mechanism modification that

introduces sequence coding into embeddings. This allows a malicious server to recover

a large amount of data from mulitple users - a setting that had previously not been in-

vestigated. We vastly outperform the state-of-the-art for “honest-but-curious” attacks on

federated language models and demonstrate the threat posed by such an attack on multiple

commonly used architectures.

We believe that revealing such threats to the data-security pipeline is important for

both practitioners and users, and subsequent investigation, including investigation into

defenses, should be performed.

vi


Table of Contents

Acknowledgements ii

Table of Contents iv

Introduction iv

Chapter 0: Preliminaries 1
0.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

0.1.1 Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . 2
0.1.2 Transformer Models . . . . . . . . . . . . . . . . . . . . . . . . 3

0.2 Training Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.4 Attacks on Deep Learning Systems . . . . . . . . . . . . . . . . . . . . . 6

0.4.1 Adversarial Examples . . . . . . . . . . . . . . . . . . . . . . . . 6
0.4.2 Data Poisoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

0.5 Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Chapter 1: Integrity Poisoning Attacks 8
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Efficient Poison Brewing . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3.1 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.3 The Central Mechanism: Gradient Alignment . . . . . . . . . . . 16
1.3.4 Making attacks that transfer and succeed “in the wild” . . . . . . 17

1.4 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.5.1 Evaluations on CIFAR-10 . . . . . . . . . . . . . . . . . . . . . 24
1.5.2 Poisoning ImageNet models . . . . . . . . . . . . . . . . . . . . 26
1.5.3 Deficiencies of Defense Strategies . . . . . . . . . . . . . . . . . 27
1.5.4 Deficiencies of Filtering Defenses . . . . . . . . . . . . . . . . . 29
1.5.5 Details: Defense by Differential Privacy . . . . . . . . . . . . . . 30
1.5.6 Full-scale MetaPoison Comparisons on CIFAR-10 . . . . . . . . 30
1.5.7 Details: Gradient Alignment Visualization . . . . . . . . . . . . . 31
1.5.8 Ablation Studies - Reduced Brewing/Victim Training Data . . . . 33
1.5.9 Ablation Studies - Method . . . . . . . . . . . . . . . . . . . . . 34

vii


1.5.10 Transfer Experiments . . . . . . . . . . . . . . . . . . . . . . . . 36
1.5.11 Multi-Target Experiments . . . . . . . . . . . . . . . . . . . . . . 36

1.6 Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.7 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

1.7.1 Cloud AutoML Setup . . . . . . . . . . . . . . . . . . . . . . . . 42
1.7.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.7.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

1.8 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Chapter 2: Availability Poisoning Attacks 48
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3 Adversarial Examples as Poisons . . . . . . . . . . . . . . . . . . . . . . 52

2.3.1 Threat Model and Motivation . . . . . . . . . . . . . . . . . . . . 53
2.3.2 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.3.3 Technical Details for Successful Attacks . . . . . . . . . . . . . . 56
2.3.4 Baseline CIFAR-10 Results . . . . . . . . . . . . . . . . . . . . . 57
2.3.5 Large Scale Poisoning . . . . . . . . . . . . . . . . . . . . . . . 59
2.3.6 Facial Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.3.7 Less Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.3.8 Defenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

2.4 Additional Experiments and Details . . . . . . . . . . . . . . . . . . . . 64
2.4.1 Training/Crafting details . . . . . . . . . . . . . . . . . . . . . . 64
2.4.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.4.3 Adversary comparison . . . . . . . . . . . . . . . . . . . . . . . 67
2.4.4 Crafting Ablations . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.4.5 Network Variation . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.4.6 Relabeling Trick . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.4.7 Learning Without Seeing . . . . . . . . . . . . . . . . . . . . . . 70
2.4.8 ImageNet Comparison . . . . . . . . . . . . . . . . . . . . . . . 72
2.4.9 Instability of Untargeted Attacks . . . . . . . . . . . . . . . . . . 72

2.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Chapter 3: Malicious Model Modifications for Federated Learning 78
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.2 Limitations of Existing Attack Strategies . . . . . . . . . . . . . . . . . . 81
3.3 Model Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

3.3.1 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.3.2 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.3.3 Imprinting User Information into Model Updates . . . . . . . . . 87

3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.4.1 Full batch recovery . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.4.2 Privacy breaches in industrial-sized batches – One-shot Attacks . 96

viii


3.4.3 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.4.4 Other Choices of Linear Functions and Distributions . . . . . . . 101
3.4.5 Comparison to Honest Servers and Optimization-based Attacks . . 103
3.4.6 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.4.7 Additional Images . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.4.8 Defense Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 109

3.5 Potential Defense and Mitigation Strategies . . . . . . . . . . . . . . . . 110
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

Chapter 4: Attacking Federated Learning for Text 115
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.2 Motivation and Threat Model . . . . . . . . . . . . . . . . . . . . . . . . 117
4.3 Method - How to Program a Corrupted Transformer . . . . . . . . . . . . 120

4.3.1 Getting Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.3.2 Getting Positions . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.3.3 Getting Sequences . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.3.4 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . 127

4.4 Empirical Evaluation of the Attack . . . . . . . . . . . . . . . . . . . . . 129
4.5 Mitigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.6 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.7 Algorithm Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.8 Measurement Transferability . . . . . . . . . . . . . . . . . . . . . . . . 138
4.9 Variants and Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

4.9.1 Masked Language Modelling . . . . . . . . . . . . . . . . . . . . 140
4.9.2 GeLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.9.3 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

4.10 Additional Background Material . . . . . . . . . . . . . . . . . . . . . . 143
4.11 Further Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.12 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

Bibliography 146

ix


Chapter 0: Preliminaries

0.1 Neural Networks

Historically, neural networks (sometimes referred to as artificial neural networks),

arose in an attempt to understand the behavior of the brain [101]. The first implentation of

a neural network, called a perceptron, was developed by Frank Rosenblatt in 1958 [135].

While on neural networks continued after the introduction of the perceptron, growth in the

field of machine learning accelerated with increased computational capacity that emerged

in the 21st century. This increased capacity led to the advent of deep learning which

loosely comprises the area of research surrounding training, optimizing and deploying

modern networks of increased depth and capacity [54]. Nowadays, there exist several

flavors of neural networks. One of the most common types is the feed-forward neural

network.

Broadly speaking, feed-forward neural networks are compositions of affine func-

tions with non-linearities. For example, a three layer neural network could look like:

f (3)(f (2)(f (1)(x))) where f (i)(x) = g(W (i)x+ b(i)) and g is a non-linearity such as ReLU

(ReLU(x) = max(0, x)) that operates pointwise on the output of the affine function

parameterized by weights W (i) and biases b(i) [54]. The aim of such a composition is to

approximate some function f ∗. For example, f ∗ could be a classifier of images. Feed-

1


Figure 0.1: A feed-forward MLP with one hidden layer with two neurons trained to rep-
resent the “XOR” function [54]
.

forward networks where the weights W (i) are general linear functions are often called

multi-layer perceptrons, or MLPs [54].

0.1.1 Convolutional Networks

There exist several variants of feed-forward networks. One of the most common

variants is a convolutional feed-forward network [90]. A convolutional feed-forward net-

work is a network wherein the linear maps determined by W (i) represent convolutions

with some kernel rather than general linear maps. Convolution of a 2-d input I with a 2-d

kernel K is defined as [54]:

(K ∗ I)(i, j) =
∑
k

∑
l

K(k, l)I(i+ k, j + l) (0.1)

The linear maps resulting from convolutions take the form of doubly block circulant

matrices [54]. Convolutional filters enforce translationally invariant feature extraction,

and are the backbone of many of the most successful classification and detection networks

[59].

2


Figure 0.2: A Transformer model (diagram taken from [49, 162]).

0.1.2 Transformer Models

Recently, a new type of feed-forward network has emerged as a powerful tool for

both image classification and language processing tasks. The Transformer model com-

bines standard feed-forward layers with attention mechanisms to achieve competitive res-

ults in fields traditionally dominated by recurrent neural networks [162] or convolutional

networks [33]. The attention mechanism of transformer models allows for efficient and

parallel training of transformer models compared with recurrent neural networks. A dia-

gram of the Transformer architecture can be found in Figure 0.2.

0.2 Training Neural Networks

Practitioners aim to find network parameters that minimize some loss Lwhich eval-

uates the output of the network on some distribution D of interest. For example, the loss

3


for a model f with parameters θ can be expressed as:

L∗(θ) = E(x,y)∼DL(f(x; θ), y).

However, when it comes to actually optimizing the parameters θ, this problem is usually

converted to an empirical risk minimization problem wherein the practitioner minimizes

∑
(x,y)∈Dtrain

L(f(x; θ), y).

Common loss functions include cross-entropy loss [54] for classification tasks, defined

as:

L(y∗, y) = −
∑
i

yi log(y∗)i,

for ground-truth label y∗ and network output y (fed through a softmax function to produce

prbabilities. Both y∗, y are discreet distributions over the label space. Another common

loss function is regularized l2 loss for regression tasks. This is defined as:

L(y∗, y) =
∑
i

(yi − y∗i )
2 + α||θ||22,

or regularized l2 loss for regression tasks.

Another piece of the training puzzle is the choice of optimization algorithm. One

of the most common choices is the stochastic gradient descent (SGD) algorithm [54]. In

this algorithm, the parameters θ are updated using the gradient direction from a stochastic

4


minibatch as follows:

θi+1 = θi − η∇θ

∑
i

L(f(x(i); θ), y(i)),

for some minibatch {x(1), . . . , x(m)} and some learning rate η [54]. Other algorithms,

like SGD with momentum, and ADAM [80] have been shown to be successful in certain

settings.

0.3 Datasets

Datasets are another fundamental component of neural network training. In vis-

ion, there exist several datasets of varying complexity and size. One common dataset is

CIFAR-10 [86]. which consists of color images of size 32x32 coming from 10 classes.

Usually, the dataset is split into 50,000 training examples, 5,000 from each class, and

10,000 testing images.

A more complex and much larger dataset commonly used for vision tasks is the

ILSVRC2012 (ImageNet) dataset [137]. This dataset consists of over 1,000,000 color

images from 1,000 classes. The images are of varying size, but a common preprocessing

step is to crop the images to size 224x224. Still other datasets exist for different tasks like

detection, segmentation, etc.

For natural language processing tasks, one common dataset is the WikiText dataset,

which consists of over 100,000,000 tokens from curated Wikipedia articles [108]. We also

utilize the Shakespeare dataset [77] consisting of 40,000 lines from different Shakespeare

plays.

5


0.4 Attacks on Deep Learning Systems

Attacks on deep learning systems are an important area of research in the direction

of deploying deep learning systems in security-critical applications. One avenue of attack

is data manipulation. In this vein, attacks can generally be classified as train-time attacks

or inference-time (or test-time) attacks.

0.4.1 Adversarial Examples

Adversarial examples are inference-time attacks against neural networks [55]. These

adversarial examples are minimally perturbed inputs that cause an already trained net-

work to mis-classify the modified examples. For example, an image of a dog could be

minimally perturbed, maintaining the semantic label dog to human observers, while be-

ing classified as a cat by a network. The most common method for producing adversarial

examples is projected gradient descent (PGD) which iteratively takes steps maximizing

the loss with respect to the input pixels, and projecting onto an `p ball [99].

0.4.2 Data Poisoning

In contrast to adversarial examples, data poisoning attacks are train-time attacks.

These attacks involve maliciously modifying data on which an unwitting practitioner

trains a neural network. This is generally considered to be a more difficult problem than

crafting adversarial examples for a network since the training procedure, and thus final

parameters of a network are not known to the poisoner. Several attacks, and several goals

exist within the data poisoning literature, including targeted data poisoning attacks which

6


aim to cause mis-classification of a pre-selected target datapoint, or availability attacks

which aim to degrade overall performance of a victim network on some distribution [6].

0.5 Federated Learning

Federated learning (FL) is a system for training networks in a distributed fashion.

Generally speaking, in standard training, a company could collect user data {(xi, yi)}Ni=1

from users, and simply optimize

arg min
θ

∑
i

L(f(xi; θ), yi).

However in FL, the company, or central server, only receives user updates [14]. Formally,

the server receives updates {θ∗i }Ni=1 from users, and then computes the new parameters

θk+1 = θk − η
N∑
i=1

αiθ
∗
i .

The main advantage of this setup is user privacy as the user data never gets sent to

the central server.

7


Chapter 1: Integrity Poisoning Attacks

Data poisoning attacks modify training data to maliciously control a model trained

on such data. Previous attacks against deep neural networks have been limited in scope

and success, working only in simplified settings or being prohibitively expensive for large

datasets. In this work, we focus on a particularly malicious poisoning attack that is both

“from scratch” and “clean label”, meaning we analyze an attack that successfully works

against new, randomly initialized models, and is nearly imperceptible to humans, all while

perturbing only a small fraction of the training data. The central mechanism of this attack

is matching the gradient direction of malicious examples. We analyze why this works,

supplement with practical considerations. and show its threat to real-world practitioners,

finding that it is the first poisoning method to cause targeted misclassification in modern

deep networks trained from scratch on a full-sized, poisoned ImageNet dataset. Finally

we demonstrate the limitations of existing defensive strategies against such an attack,

concluding that data poisoning is a credible threat, even for large-scale deep learning

systems. This work was performed together with Jonas Geiping, Ronny Huang, Wojtek

Czaja, Gavin Taylor, and Tom Goldstein. My contributions include: jointly conceiving of

the mechanism of gradient alignment, a substantial amount of the large-scale experiments,

jointly formulating the theoretical result, and writing a substantial portion of the paper.

8


1.1 Introduction

Machine learning models have quickly become the backbone of many applications

from photo processing on mobile devices and ad placement to security and surveillance

[91]. These applications often rely on large training datasets that aggregate samples of

unknown origins, and the security implications of this are not yet fully understood [119].

Data is often sourced in a way that lets malicious outsiders contribute to the dataset, such

as scraping images from the web, farming data from website users, or using large aca-

demic datasets scraped from social media [157]. Data poisoning is a security threat in

which an attacker makes imperceptible changes to data that can then be disseminated

through social media, user devices, or public datasets without being caught by human

supervision. The goal of a poisoning attack is to modify the final model to achieve a

malicious goal. Poisoning research has focuses on attacks that achieve mis-classification

of some predetermined target data in Shafahi et al. [141], Suciu et al. [151], i.e. imple-

menting a backdoor - but other potential goals of the attacker include denial-of-service

[143, 149], concealment of users [142], and introduction of fingerprint information [97].

These attacks are applied in scenarios such as social recommendation [63], content man-

agement [37, 92], algorithmic fairness [146] and biometric recognition [96]. Accordingly,

industry practitioners ranked data poisoning as the most serious attack on ML systems in

a recent survey of corporations [88].

In this work we show that efficient poisoned data can be created even in the setting

of deep neural networks trained on large image classification tasks, such as ImageNet

[137]. Previous work on data poisoning has often focused on either linear classification

9


Figure 1.1: The poisoning pipeline. Poisoned images (labrador retriever class) are inser-
ted into a dataset and cause a newly trained victim model to mis-classify a target (otter)
image. We show successful poisons for a threat model where 0.1% of training data is
changed within an `∞ bound of ε = 8. Further visualizations of poisoned data can be
found in Section 1.6.

tasks [10, 84, 167] or poisoning of transfer learning and fine tuning [82, 141] rather than a

full end-to-end training pipeline. Poison attacks on deep neural networks (and especially

on ones trained from scratch) have proven difficult in Muñoz-González et al. [110] and

Shafahi et al. [141]. Only recently were attacks against neural networks retrained from

scratch shown to be possible in [68] for CIFAR-10 - however with costs that render scaling

to larger datasets, like the ImageNet, prohibitively expensive.

We formulate data poisoning as the problem of solving a gradient matching problem

and analyze the resulting novel attack algorithm that scales to unprecedented dataset size

and effectiveness. Crucially, the new poisoning objective is orders-of-magnitude more

efficient than a previous formulation based on on meta learning [68] and succeeds more

often. We conduct an experimental evaluation, showing that poisoned datasets created

by this method are robust and significantly outperform other attacks on CIFAR-10. We

then demonstrate reliably successful attacks on common ImageNet models in realistic

training scenarios. For example, the attack successfully compromises a ResNet-34 by

10


manipulating only 0.1% of the data points with perturbations less than 8 pixel values in

`∞-norm. We close by discussing previous defense strategies and how strong differential

privacy [1] is the only existing defense that can partially mitigate the effects of the attack.

1.2 Related Work

The task of data poisoning is closely related to the problem of adversarial attacks

at test time, also referred to as evasion attacks [100, 155], where the attacker alters a

target test image to fool an already-trained model. This attack is applicable in scenarios

where the attacker has control over the target image, but not over the training data. An

intermediary between data poisoning and adversarial attacks are backdoor trigger attacks

[138, 160]. These attacks involve inserting a trigger – often an image patch – into training

data, which is later activated by also applying the trigger to test images. Backdoor attacks

require perturbations to both training and test-time data – a more permissive threat model

than either poisoning or evasion.

In contrast to evasion and backdoor attacks, data poisoning attacks consider a set-

ting where the attacker can modify training data, but does not have access to test data.

Within this setting we focus on targeted attacks – attacks that aim to cause a specific tar-

get test image (or set of target test images) to be mis-classified. For example, an attack

may cause a certain target image of a otter to be classified as a dog by victim models at

test time. This attack is difficult to detect, because it does not noticeably degrade either

training or validation accuracy [68, 141].

Two basic schemes for targeted poisoning are label flipping [6, 124], and water-

11


marking [141, 151]. In label flipping attacks, an attacker is allowed to change the label of

examples, whereas in a watermarking attack, the attacker perturbs the training image, not

label, by superimposing a target image onto training images. These attacks can be suc-

cessful, yet they are easily detected by supervision such as Papernot and McDaniel [120].

This is in contrast to clean-label attacks which maintain the semantic labels of data.

Mathematically, data poisoning is a bilevel optimization problem [5, 10]; the at-

tacker optimizes image pixels to enforce (malicious) criteria on the resulting network

parameters, which are themselves the solution to an “inner” optimization problem that

minimizes the training objective. Direct solutions to the bilevel problem have been pro-

posed where feasible, for example, SVMs in Biggio et al. [10] or logistic regression in

Demontis et al. [29]. However, direct optimization of the poisoning objective is intract-

able for deep neural networks because it requires backpropagating through the entire SGD

training procedure, see [110]. As such, the bilevel objective has to be approximated. Re-

cently, MetaPoison [68] proposed to approximately solve the bi-level problem based on

methods from the meta-learning community [41]. The bilevel gradient is approximated

by backpropagation through several unrolled gradient descent steps. This is the first at-

tack to succeed against deep networks on CIFAR-10 as well as providing transferability to

other models. Yet, [68] uses a complex loss function averaged over a wide range of mod-

els trained to different epochs and a single unrolling step necessarily involves both clean

and poisoned data, making it roughly as costly as one epoch of standard training. With an

ensemble of 24 models, [68] requires 3 (2 unrolling steps + 1 clean update step) x 2 (back-

propagation through unrolled steps) x 60 (first-order optimization steps) x 24 (ensemble

of models) equivalent epochs of normal training to attack, as well as (
∑22

k=0 k = 253)

12


epochs of pretraining. All in all, this equates to 8893 training epochs.

In contrast to bilevel approaches stand heuristics for data poisoning of neural net-

works. The most prominent heuristic is feature collision, as in Poison Frogs [141], which

seeks to cause a target test image to be misclassified by perturbing training data to col-

lide with the target image in feature space. Modifications surround the target image in

feature space with a convex polytope [178] or collection of poisons [2]. These methods

are efficient, but designed to attack fine-tuning scenarios where the feature extractor is

nearly fixed. When applied to deep networks trained from scratch, their performance

drops significantly.

1.3 Efficient Poison Brewing

In this section, we will discuss an intriguing weakness of neural network training

based on first-order optimization and derive an attack against it. This attack modifies

training images that so they produce a malicious gradient signal during training, even

while appearing inconspicuous. This is done by matching the gradient of the target images

within `∞ bounds. Because neural networks are trained by gradient descent, even minor

modifications of the gradients can be incorporated into the final model.

This attack compounds the strengths of previous schemes, allowing for data pois-

oning as efficiently as in Poison Frogs [141], requiring only a single pretrained model and

a time budget on the order of one epoch of training for optimization - but still capable of

poisoning the from-scratch setting considered in [68]. This combination allow an attacker

to ”brew” poisons that successfully attack realistic models on ImageNet.

13


1.3.1 Threat Model

We define two parties, the attacker, which has limited control over the training

data, and the victim, which trains a model based on this data. We first consider a gray-box

setting, where the attacker has knowledge of the model architecture used by its victim.

The attacker is permitted to poison a fraction of the training dataset (usually less than

1%) by changing images within an `∞-norm ε-bound (e.g. with ε ≤ 16). This constraint

enforces clean-label attacks, meaning that the semantic label of a poisoned image is still

unchanged. The attacker has no knowledge of the training procedure - neither about the

initialization of the victim’s model, nor about the (randomized) mini-batching and data

augmentation that is standard in the training of deep learning models.

We formalize this threat model as bilevel problem for a machine learning model

F (x, θ) with inputs x ∈ Rn and parameters θ ∈ Rp (implicitly a vector-valued function of

the perturbations), and loss function L. We denote the N training samples by (xi, yi)
N
i=1,

from which a subset of P samples are poisoned. For notational simplicity we assume

the first P training images are poisoned by adding a perturbation ∆i to the ith training

image. The perturbation is constrained to be smaller than ε in the `∞-norm. The task

is to optimize ∆ so that a set of T target samples (xti, y
t
i)
T
i=1 is reclassified with the new

adversarial labels yadv
i :

min
∆∈C

T∑
i=1

L
(
F (xti, θ(∆)), yadv

i

)
s.t. θ(∆) ∈ arg min

θ

1

N

N∑
i=1

L(F (xi + ∆i, θ), yi).

(1.1)

14


We subsume the constraints in the set C = {∆ ∈ RN×n : ||∆||∞ ≤ ε,∆i = 0 ∀i >

P}. We call the main objective on the left the adversarial loss, and the objective that

appears in the constraint on the right is the training loss. For the remainder, we consider

a single target image (T = 1) as in Shafahi et al. [141], but stress that this is not a general

limitation as shown in the appendix.

1.3.2 Motivation

What is the optimal alteration of the training set that causes a victim neural network

F (x, θ) to mis-classify a specific target image xt? We know that the expressivity of

deep networks allows them to fit arbitrary training data [172]. Thus, if an attacker was

unconstrained, a straightforward way to cause targeted mis-classification of an image is

to insert the target image, with the incorrect label yadv, into the victim networks training

set. Then, when the victim minimizes the training loss they simultaneously minimize

the adversarial loss, based on the gradient information about the target image. In our

threat model however, the attacker is not able to insert the mis-labeled target. They can,

however, still mimic the gradient of the target by creating poisoned data whose training

gradient correlates with the adversarial target gradient. If the attacker can enforce

∇θL(F (xt, θ), yadv) ≈ 1

P

P∑
i=1

∇θL(F (xi + ∆i, θ), yi) (1.2)

to hold for any θ encountered during training, then the victim’s gradient steps that minim-

ize the training loss on the poisoned data (right hand side) will also minimize the attackers

adversarial loss on the targeted data (left side).

15


1.3.3 The Central Mechanism: Gradient Alignment

Gradient magnitudes vary dramatically across different stages of training, and so

finding poisoned images that satisfy Equation 1.2 for all θ encountered during training is

infeasible. Instead we align the target and poison gradients in the same direction, that is

we minimize their negative cosine similarity. We do this by taking a clean model F with

parameters θ, keeping θ fixed, and then optimizing

B(∆, θ) = 1−
〈
∇θL(F (xt, θ), yadv),

∑P
i=1∇θL(F (xi + ∆i, θ), yi)

〉
‖∇θL(F (xt, θ), yadv)‖ · ‖

∑P
i=1∇θL(F (xi + ∆i, θ), yi)‖

. (1.3)

Algorithm 1 Poison Brewing via the discussed approach.

1: Require Pretrained clean network {F (·, θ)}, a training set of images and labels
(xi, yi)

N
i=1, a target (xt, yadv), P < N poison budget, perturbation bound ε, restarts R,

optimization steps M
2: Begin
3: Select P training images with label yadv

4: For r = 1, . . . , R restarts:
5: Randomly initialize perturbations ∆r ∈ C
6: For j = 1, . . . ,M optimization steps:
7: Apply data augmentation to all poisoned samples (xi + ∆r

i )
P
i=1

8: Compute the average costs, B(∆r, θ) as in Equation 1.3, over all poisoned
samples

9: Update ∆r with a step of signed Adam and project onto ||∆r||∞ ≤ ε
10: Choose the optimal ∆∗ as ∆r with minimal value in B(∆r, θ)
11: Return Poisoned dataset (xi + ∆∗i , yi)

N
i=1

We optimize B(∆) using signed Adam updates with decaying step size, project-

ing onto C after every step. This produces an alignment between the averaged poison

gradients and the target gradient.

In contrast to Poison Frogs, all layers of the network are included (via their para-

meters) in this objective, not just the last feature layer.

16


Each optimization step of this attack requires only a single differentiation of the

parameter gradient w.r.t to its input, instead of differentiating through several unrolled

steps as in MetaPoison. Furthermore, as in Poison Frogs we differentiate through a loss

that only involves the (small) subset of poisoned data instead of involving the entire data-

set, such that the attack is especially fast if the budget is small. Finally, the method is able

to create poisons using only a single parameter vector, θ (like Poison Frogs in fine-tuning

setting, but not the case for MetaPoison) and does not require updates of this parameter

vector after each poison optimization step.

1.3.4 Making attacks that transfer and succeed “in the wild”

A practical and robust attack must be able to poison different random initializations

of network parameters and a variety of architectures. To this end, we employ several

techniques:

Differentiable Data Augmentation and Resampling: Data augmentation is a standard

tool in deep learning, and transferable image perturbations must survive this process. At

each step minimizing Equation 1.3, we randomly draw a translation, crop, and possibly a

horizontal flip for each poisoned image, then use bilinear interpolation to resample to the

original resolution. When updating ∆, we differentiate through this grid sampling opera-

tion as in Jaderberg et al. [70]. This creates an attack which is robust to data augmentation

and leads to increased transferability.

Restarts: The efficiency we gained in Equation 1.3.3 allows us to incorporate restarts, a

common technique in the creation of evasion attacks [109, 130]. We minimize Equation

17


1.3 several times from random starting perturbations, and select the set of poisons that

give us the lowest alignment loss B(∆). This allows us to trade off reliability with com-

putational effort.

Model Ensembles: A known approach to improving transferability is to attack an en-

semble of model instances trained from different initializations [68, 95, 178]. However,

ensembles are highly expensive, increasing the pre-training cost for only a modest, but

stable, increase in performance.

We show the effects of these techniques via CIFAR-10 experiments (see Table 1.1

and Section 1.5.1). To keep the attack within practical reach, we do not consider en-

sembles for our experiments on ImageNet data, opting for the cheaper techniques of re-

starts and data augmentation. A summarizing description of the attack can be found in

Algorithm 1. Lines 8 and 9 of Algorithm 1 are done in a stochastic (mini-batch) setting

(which we omitted in Algorithm 1 for notational simplicity).

1.4 Theoretical Analysis

Can gradient alignment cause network parameters to converge to a model with low

adversarial loss? To simplify presentation, we denote the adversarial loss and normal

training loss of Equation 1.1 as

Ladv(θ) =: L(F ((xt, θ), yadv),

18


and

L(θ) =:
1

N

N∑
i=1

L(xi, yi, θ),

respectively. Also, recall that 1−B
(
∆, θk

)
, defined in Equation 1.3, measures the cosine

similarity between the gradient of the adversarial loss and the gradient of normal training

loss. We adapt a classical result of Zoutendijk [116, Thm. 3.2] to shed light on why data

poisoning can work even though the victim only performs standard training on a poisoned

dataset:

Proposition 1.4.1 (Adversarial Descent). LetLadv(θ) be bounded below and have a Lipschitz

continuous gradient with constant L > 0 and assume that the victim model is trained by

gradient descent with step sizes αk, i.e. θk+1 = θk − αk∇L(θk). If the gradient descent

steps αk > 0 satisfy

0 < αkL < β
(
1− B(∆, θk)

) ||∇L(θk)||
||∇Ladv(θk)||

, (1.4)

for some fixed β < 1, then Ladv(θk+1) < Ladv(θk). If in addition ∃ε > 0, k0 so that

∀k ≥ k0, B(∆, θk) < 1− ε, then

lim
k→∞
||∇Ladv(θ

k)|| → 0. (1.5)

Proof. Consider the gradient descent update

θk+1 = θk − αk∇L(θk)

19


Firstly, due to Lipschitz smoothness of the gradient of the adversarial loss Ladv we can

estimate the value at θk+1 by the descent lemma

Ladv(θ
k+1) ≤ Ladv(θ

k)− 〈αk∇Ladv(θ
k),∇L(θk)〉+ α2

kL||∇L(θk)||2

If we further use the cosine identity:

〈∇Ladv(θ
k),∇L(θk)〉 = ||∇L(θk)||||∇Ladv(θ

k)|| cos(γk),

denoting the angle between both vectors by γk, we find that

Ladv(θ
k+1) ≤ Ladv(θ

k)− αk||∇L(θk)||||∇Ladv(θ
k)|| cos(γk) + α2

kL||∇L(θk)||2

= Ladv(θ
k)−

(
αk
||∇Ladv(θ

k)||
||∇L(θk)||

cos(γk)− α2
kL

)
||∇L(θk)||2

As such, the adversarial loss decreases for nonzero step sizes if

||∇Ladv(θ
k)||

||∇L(θk)||
cos(γk) > αkL

i.e.

αkL ≤
||∇Ladv(θ

k)||
||∇L(θk)||

cos(γk)

c

for some 1 < c < ∞. This follows from our assumption on the parameter β in the

statement of the proposition. Reinserting this estimate into the descent inequality reveals

20


that

Ladv(θ
k+1) < Ladv(θ

k)− ||∇Ladv||2
cos(γk)

c′L
,

for 1
c′

= 1
c
− 1

c2
. Due to monotonicity we may sum over all descent inequalities, yielding

Ladv(θ
0)− Ladv(θ

k+1) ≥ 1

c′L

k∑
j=0

||∇Ladv(θ
j)||2 cos(γj)

As Ladv is bounded below, we may consider the limit of k →∞ to find

∞∑
j=0

||∇Ladv(θ
j)||2 cos(γj) <∞.

If for all, except finitely many iterates the angle between adversarial and training gradient

is less than 180◦, i.e. cos(γk) is bounded below by some fixed ε > 0, as assumed, then

the convergence to a stationary point follows:

lim
k→∞
||∇Ladv(θ

k)|| → 0

In Figure 1.2 we visualize measurements of the computed bound from an actual

poisoned training. The classical gradient descent converges only if αkL < 1, so we can

find an upper bound to this value by 1, even if the actual Lipschitz constant of the neural

network training objective is not known to us.

Put simply, our poisoning method aligns the gradients of training loss and ad-

21


0 10 20 30

0

0.5

1

1.5

2

2.5

3

3.5 Poisoned ResNet18
Clean ResNet18
Lower Bound

Epochs

Bo
un

d 
va

lu
e

Figure 1.2: The bound considered in Prop. 1, evaluated during training of a poisoned
and a clean model, using a practical estimation of the lower bound via αkL ≈ 1. This is
an upper bound of αkL as αk < 1

L
is necessary for the convergence of (clean) gradient

descent.

versarial loss. This enforces that the gradient of the main objective is a descent direction

for the adversarial objective, which, when combined with conditions on the step sizes,

causes a victim to unwittingly converge to a stationary point of the adversarial loss, i.e.

optimize the original bilevel objective locally.

The strongest assumption in Proposition 1.4.1 is that gradients are almost always

aligned, B(∆, θk) < 1 − ε, k ≥ k0. We directly maximize alignment during creation of

the poisoned data, but only for a selected θ∗, and not for all θk encountered during gradi-

ent descent from any possible initialization. However, poison perturbations made from

one parameter vector, θ, can transfer to other parameter vectors encountered during train-

ing. For example, if one allows larger perturbations, and in the limiting case, unbounded

perturbations, our objective is minimal if the poison data is identical to the target image,

which aligns training and adversarial gradients at every θ encountered. Empirically, we

22


R=8, K=8

R=8, K=1

R=8, K=2

R=1, K=2

R=1, K=1

R=1, K=4

R=1, K=8

R=8, K=4

2 3 4 5 6 7 8 9
100

2 3 4 5

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Time (Minutes)

Av
g.

 P
oi

so
n 

Su
cc

es
s

(a) Crafting time versus poisons success for dif-
ferent hyperparameters.

0 5 10 15 20 25 30 35 40
100

10 1

10 2

10 3
0

10 3

10 2

10 1

100

Clean ResNet18
Poisoned ResNet18

(b) Average batch cosine similarity, per epoch,
between the adversarial gradient ∇Ladv(θ) and
the gradient of each mini-batch ∇L(θ) for a
poisoned and a clean ResNet-18. Crucially,
the gradient alignment is strictly positive after
a small number of epochs.

see that the proposed ”poison brewing” attack does indeed increase gradient alignment.

In Figure 1.3b, we see that in the first phase of training all alignments are positive, but

only the poisoned model maintains a positive similarity for the adversarial target-label

gradient throughout training. The clean model consistently shows that these angles are

negatively aligned - i.e. normal training on a clean dataset will increase adversarial loss.

However, after the inclusion of poisoned data, the gradient alignment is modified enough

to change the prediction for the target.

1.5 Experimental Evaluation

We evaluate poisoning approaches in each experiment by sampling 10 random

poison-target cases. We compute poisons for each and evaluate them on 8 newly ini-

tialized victim models (see Section 1.7 for details of our methodology). We use the fol-

lowing hyperparameters for all our experiments: τ = 0.1, R = 8, M = 250. We train

victim models in a realistic setting, considering data augmentation, SGD with momentum,

23


weight decay and learning rate drops.

1.5.1 Evaluations on CIFAR-10

As a baseline on CIFAR-10, we compare the number of restarts R and the num-

ber of ensembled models K, showing that the proposed method is successful in creating

poisons even with just a single model (instead of an ensemble). The inset figure shows

poison success versus time necessary to compute the poisoned dataset for a budget of 1%,

ε = 16 on CIFAR-10 for a ResNet-18. We find that as the number of ensemble models,

K, increases, it is beneficial to increase the number of restarts as well, but increasing

the number of restarts independently also improves performance. We validate the dif-

ferentiable data augmentation discussed in Section 1.3.4 in Table 1.1, finding it crucial

for scalable data poisoning, being as efficient as a large model ensemble in facilitating

robustness.

Next, to test different poisoning methods, we fix our ”brewing” framework of effi-

cient data poisoning, with only a single network and diff. data augmentation. We evaluate

the discussed gradient matching cost function, replacing it with either the feature-collision

objective of Poison Frogs or the bullseye objective of Aghakhani et al. [2], thereby effect-

ively replicating their methods, but in our context of from-scratch training.

Table 1.1: CIFAR-10 ablation. ε = 16, budget is 1%. Differentiable data augmentation is
able to replace a large 8-model ensemble, without increasing computational effort.

Ensemble Diff. Data Aug. Victim does data aug. Poison Accuracy (%(±SE))
1 X X 100.00% (±0.00)
1 X X 32.50% (±12.27)
8 X X 78.75% (±11.77)
1 X X 91.25% (±6.14)

24


Table 1.2: CIFAR-10 Comparison to other poisoning objectives with a budget of 1%
within our framework (columns 1 to 3), for a 6-layer ConvNet and an 18-layer ResNet.
MetaPoison* denotes the full framework of Huang et al. [68]. Each cell shows the avg.
poison success and its standard error.

Proposed Bullseye Poison Frogs MetaPoison*
ConvNet (ε = 32) 86.25% (±9.43) 78.75% (±7.66) 52.50% (±12.85) 35.00% (±11.01)

ResNet-18 (ε = 16) 90.00% (±3.87) 3.75% (±3.56) 1.25% (±1.19) 42.50 % (±8.33)

The results of this comparison are collated in Table 1.2. While Poison Frogs and

Bullseye succeeded in finetuning settings, we find that their feature collision objectives

are only successful in the shallower network in the from-scratch setting. Gradient match-

ing further outperforms MetaPoison on CIFAR-10, while faster (see Section 1.5.6), in

particular as K = 24 for MetaPoison.

Benchmark results on CIFAR-10: To evaluate our results against a wider range of

poison attacks, we consider the recent benchmark proposed in Schwarzschild et al. [140]

in Table 1.3. In the category ”Training From Scratch”, this benchmark evaluates poisoned

CIFAR-10 datasets with a budget of 1% and ε = 8 against various model architectures,

averaged over 100 fixed scenarios. We find that the discussed gradient matching attack,

even for K = 1 is significantly more potent in the more difficult benchmark setting. An

additional feature of the benchmark is transferability. Poisons are created using a ResNet-

18 model, but evaluated also on two other architectures. We find that the proposed attack

transfers to the similar MobileNet-V2 architecture, but not as well to VGG11. However,

we also show that this advantage can be easily circumvented by using an ensemble of

different models as in Zhu et al. [178]. If we use an ensemble of K = 6, consisting of

2 ResNet-18, 2 MobileNet-V2 and 2 VGG11 models (last row), then the same poisoned

dataset can compromise all models and generalize across architectures.

25


Table 1.3: Results on the benchmark of [140]. Avg. accuracy of poisoned CIFAR-10
(budget 1%, ε = 8) over 100 trials is shown. (*) denotes rows replicated from [140].
Poisons are created with a ResNet-18 except for the last row, where the ensemble consists
of two models of each architecture.

Attack ResNet-18 MobileNet-V2 VGG11 Average
Poison Frogs* [141] 0% 1% 3% 1.33%

Convex Polytopes* [178] 0% 1% 1% 0.67%
Clean-Label Backd.* [160] 0% 1% 2% 1.00%

Hidden-Trigger Backd.* [138] 0% 4% 1% 2.67%
Proposed Attack (K = 1) 45% 36% 8% 29.67%
Proposed Attack (K = 4) 55% 37% 7% 33.00%

Proposed Attack (K = 6, Heterogeneous) 49% 38% 35% 40.67%

1.5.2 Poisoning ImageNet models

The ILSVRC2012 challenge, ”ImageNet”, consists of over 1 million training ex-

amples, making it infeasible for most actors to train large model ensembles or run extens-

ive hyperparameter optimizations. However, as the new gradient matching attack requires

only a single sample of pretrained parameters θ, and operates only on the poisoned sub-

set, it can poison ImageNet images using publicly available pretrained models without

ever training an ImageNet classifier. Poisoning ImageNet with previous methods would

be infeasible. For example, following the calculations in Section 1.2, it would take over

500 GPU days (relative to our hardware) to create a poisoned ImageNet for a ResNet-18

via MetaPoison. In contrast, the new attack can poison ImageNet in less than four GPU

hours.

Figure 1.4 shows that a standard ImageNet models trained from scratch on a poisoned

dataset ”brewed” with the discussed attack, are reliably compromised - with examples of

successful poisons shown (left). We first study the effect of varying poison budgets, and

ε-bounds (top right). Even at a budget of 0.05% and ε-bound of 8, the attack poisons a

26


ε=16
 b=0.10%

ε=8
 b=0.10%

ε=16
 b=0.05%

ε=8
 b=0.05%

ε=16
 b=0.01%

ε=8
 b=0.01%

0

0.2

0.4

0.6

0.8

1

Threat Model

Av
g.

 P
oi

so
n 

Su
cc

es
s

VGG16 ResNet18 MobileNet v2 ResNet34 ResNet50
0

0.2

0.4

0.6

0.8

1

Architecture

Av
g.

 P
oi

so
n 

Su
cc

es
s

Figure 1.4: Poisoning ImageNet. Left: Clean images (above), with their poisoned coun-
terparts (below) from a successful poisoning of a randomly initialized ResNet-18 trained
on ImageNet for a poison budget of 0.1% and an `∞ bound of ε = 8. Right Top: ResNet-
18 results for different budgets and varying ε-bounds. Right Bot.: More architectures
[59, 139, 145] with a budget of 0.1% and ε = 16.

randomly initialized ResNet-18 80% of the time. These results extend to other popular

models, such as MobileNet-v2 and ResNet50 (bottom right).

Poisoning Cloud AutoML: To verify that the discussed attack can compromise models

in practically relevant black-box setting, we test against Google’s Cloud AutoML. This is

a cloud framework that provides access to black-box ML models based on an uploaded

dataset. In Huang et al. [68] Cloud AutoML was shown to be vulnerable for CIFAR-10.

We upload a poisoned ImageNet dataset (base: ResNet18, budget 0.1%, ε = 32) for our

first poison-target test case and upload the dataset. Even in this scenario, the attack is

measurably effective, moving the adversarial label into the top-5 predictions of the model

in 5 out of 5 runs, and the top-1 prediction in 1 out of 5 runs.

1.5.3 Deficiencies of Defense Strategies

Previous defenses against data poisoning [123, 125, 149] have relied mainly on data

sanitization, i.e. trying to find and remove poisons by outlier detection (often in feature

27


(a) Feature space distance to base class centroid, and target image feature,
for victim model on CIFAR-10. 4.0% budget, ε = 16, showing sanitization
defenses failing and no feature collision as in Poison Frogs.

10μ 2 5 100μ 2 5 0.001 2 5 0.01
0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Natural Validation Accuracy
Poison Success
Poison Success (Defense Knowledge)

Gradient Noise

Val
ida

tio
n A

ccu
rac

y

Po
iso

n S
ucc

ess

(b) Defending through differential privacy. CIFAR-10, 1% budget, ε = 16,
ResNet-18. Differential privacy is only able to limit the success of poisoning
via trade-off with significant drops in accuracy.

Figure 1.5: Defense strategies against poisoning.

space). We demonstrate why sanitization methods fail in the face of the attack discussed

in this work in Figure 1.5a. Poisoned data points are distributed like clean data points,

reducing filtering based methods to almost-random guessing (see Table 1.4).

Differentially private training is a different defense. It diminishes the impact of in-

dividual training samples, in turn making poisoned data less effective [62, 98]. However,

this come at a significant cost. Figure 1.5b shows that to push the Poison Success below

15%, one has to sacrifice over 20% validation accuracy, even on CIFAR-10. Training a

diff. private ImageNet model is even more challenging. From this aspect, differentially

private training can be compared to adversarial training [100] against evasion attacks.

28


Both methods can mitigate the effectiveness of an adversarial attack, but only by signific-

antly impeding natural accuracy.

1.5.4 Deficiencies of Filtering Defenses

Defenses aim to sanitize training data of poisons by detecting outliers (often in fea-

ture space), and removing or relabeling these points [123, 125, 149]. In some cases, these

defenses are in the setting of general performance degrading attacks, while others deal

with targeted attacks. By in large, poison defenses up to this point are limited in scope.

For example, many defenses that have been proposed are specific to simple models like

linear classifiers and SVM, or the defenses are tailored to weaker attacks such as colli-

sion based attacks where feature space is well understood [123, 125, 149]. However, data

sanitization defenses break when faced with stronger attacks. Table 1.4 shows a defense

by anomaly filtering. averaged over 6 randomly seeded poisoning runs on CIFAR-10 (4%

budget w/ ε = 16), we find that outlier detection is only marginally more successful than

random guessing.

Table 1.4: Outlier detection is close to random-guessing for poison detection on CIFAR-
10.

10% filtering 20% filtering
Expected poisons removed (outlier method) 248 467

Expected clean removed (outlier method) 252 533
Expected poisons removed (random guessing) 200 400

Expected clean removed (random guessing) 300 600

29


1.5.5 Details: Defense by Differential Privacy

In Figure 1.5b we consider a defense by differential privacy. According to Hong

et al. [62], gradient noise is the key factor that makes differentially private SGD [1] useful

as a defense. As such we keep the gradient clipping fixed to a value of 1 and only increase

the gradient noise in Figure 1.5b. To scale differentially private SGD, we only consider

this gradient clipping on the mini-batch level, not the example level. This is reflected in

the red, dashed line. A trivial counter-measure against this defense is shown as the solid

red line. If the level of gradient noise is known to the attacker, then the attacker can brew

poisoned data by the approach shown in Algorithm 1, but also add gradient noise and

gradient clipping to the poison gradient. We use a naive strategy of redrawing the added

noise every time the matching objective B(∆, θ) is evaluated. It turns out that this yields

a good baseline counter-attack against the defense through differential privacy.

1.5.6 Full-scale MetaPoison Comparisons on CIFAR-10

Removing all constraints for time and memory, we visualize time/accuracy of our

approach against other poisoning approaches in Table 1.6. Note that attacks, like Meta-

Poison, which succeed on CIFAR-10 only after removing these constraints, cannot be

used on ImageNet-sized datasets due to the significant computational effort required. For

MetaPoison, we use the original implementation of Huang et al. [68], but add our larger

models. We find that with the larger architectures and different threat model (original

MetaPoison considers a color perturbation in addition to the `∞ bound), our gradient

matching technique still significantly outperforms MetaPoison. Note that for the Con-

30


2 5 100 2 5 1000 2 5

0

0.2

0.4

0.6

0.8

1 Gradient Matching, R=1
Gradient Matching, R=8
Poison Frogs
Watermarks
MetaPoison

Time (Minutes)

Av
g.

 P
oi

so
n 

Su
cc

es
s

Figure 1.6: CIFAR-10 comparison without time and memory constraints for a ResNet18
with realistic training. Budget 1%, ε = 16. Note that the x-axis is logarithmic.

vNet experiment on MetaPoison in Table 1.2, we found that MetaPoison seems to overfit

with ε = 32, and as such we show numbers running the MetaPoison code with ε = 16 in

that column, which are about 8% better than ε = 16. This is possibly a hyperparameter

question for MetaPoison, which was optimized for ε = 8 and a color perturbation.

1.5.7 Details: Gradient Alignment Visualization

Figure 1.7 visualizes additional details regarding Figure 1.3b. Figure 1.7a replicates

Figure 1.3b with linear scaling, whereas Figure 1.7b shows the behavior after epoch 14,

which is the first learning rate drop. Note that in all figures each measurement is averaged

over an epoch and the learning rate drops are marked with gray vertical bars. Figure 1.7c

shows the opposite metric, that is the alignment of the original (non-adversarial) gradient.

It is important to note for these figures, that the positive alignment is the crucial, whereas

the magnitude of alignment is not as important. As this is the gradient averaged over the

entire epoch, the contributions are from mini-batches can contain none or only a single

poisoned example.

31


0 10 20 30

−0.05

0

0.05

0.1
Clean ResNet18
Poisoned ResNet18

Epochs

Co
si

ne
 S

im
ila

ri
ty

(a) Alignment of∇Ladv(θ) and∇L(θ)

15 20 25 30 35 40

−0.03

−0.02

−0.01

0

0.01

Clean ResNet18
Poisoned ResNet18

Epochs

Co
si

ne
 S

im
ila

ri
ty

(b) Zoom: Alignment of∇Ladv(θ) and∇L(θ) from epoch
14.

0 10 20 30

0

0.05

0.1

0.15 Clean ResNet18
Poisoned ResNet18

Epochs

Co
si

ne
 S

im
ila

ri
ty

(c) Alignment of∇Lt(θ) (orig. label) and∇L(θ)

Figure 1.7: Average batch cosine similarity, per epoch, between the adversarial gradi-
ent and the gradient of each mini-batch (left), and with its clean counterpart ∇Lt(θ) :=
∇θL(xt, yt) (right) for a poisoned and a clean ResNet-18. Each measurement is averaged
over an epoch. Learning rate drops are marked with gray vertical bars.

32


1.5.8 Ablation Studies - Reduced Brewing/Victim Training Data

In order to further test the strength and possible limitations of the discussed poison-

ing method, we perform several ablation studies, where we reduce either the training set

known to the attacker or the set of poisons used by the victim, or both.

In many real world poisoning situations, it is not reasonable to assume that the

victim will unwittingly add all poison examples to their training set, or that the attacker

knows the full victim training set to begin with. For example, if the attacker puts 1000

poisoned images on social media, the victim might only scrape 300 of these. We test how

dependent the method is on the victim training set by randomly removing a proportion

of data (clean + poisoned) from the victim’s training set. We then train the victim on the

ablated poisoned dataset, and evaluate the target image to see if it is misclassified by the

victim as the attacker’s intended class. Then, we add another assumption - the brewing

network does not have access to all victim training data when creating the poisons (see

tab 1.5). We see that the attacker can still successfully poison the victim, even after a

large portion of the victim’s training data is removed, or the attacker does not have access

to the full victim training set.

Table 1.5: Average poisoning success under victim training data ablation. In the first
regime, victim ablation, a proportion of the victim’s training data (clean + poisoned)
is selected randomly and then the victim trains on this subset. In the second regime,
pretrained + victim ablation, the pretrained network is trained on a randomly selected
proportion of the data, and then the victim chose a new random subset of clean + poisoned
data on which to train. All results averaged over 5 runs on ImageNet.

70% data removed 50% data removed
victim ablation 60% 100%

pretrained + victim ablation 60% 80%

33


1.5.9 Ablation Studies - Method

Table 1.6 shows different variations of the proposed method. While using the

Carlini-Wagner loss as a surrogate for cross entropy helped in Huang et al. [68], it does

not help in our setting. We further find that running the proposed method for only 50 steps

(instead of 250 as everywhere else in the paper) leads to a significant loss in avg. poison

success. Lastly we investigate whether using euclidean loss instead of cosine similarity

would be beneficial. This would basically imply trying to match eq. (2) directly. Euc-

lidean loss amounts to removing the invariance to gradient magnitude, in comparison to

cosine similarity, which is invariant. We find that this is not beneficial in our experiments,

and that the invariance with respect to gradient magnitude does allow for the construction

of stronger poisoned datasets. Interestingly the discrepancy between both loss functions

is related to the width of the network. In Figure 1.8 on the left, we visualize avg. poison

success for modified ResNet-18s. The usual base width of 64 is replaced by the width

value shown on the x-axis. For widths smaller than 16, the Euclidean loss dominates, but

its effectiveness does not increase with width. In contrast the cosine similarity is superior

for larger widths and seems to be able to make use of the greater representative power of

the wider networks to find vulnerabilities. Figure 1.8 on the right examines the impact of

the pretrained model that is supplied to Algorithm 1. We compare avg. poison success

against the number of pretraining epochs for a budget of 1%, first with ε = 16 and then

with ε = 8. It turns out that for the easier threat model of ε = 8, even pretraining to only

20 epochs can be enough for the algorithm to work well, whereas in the more difficult

scenario of ε = 8, performance increases with pretraining effort.

34


4 8 16 32 64 96

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

1

Cosine Similarity
Poison Frogs
Euclidean Distance

ResNet Width

Av
er

ag
e 

Po
is

on
 S

uc
ce

ss

0 10 20 30 40

0

0.2

0.4

0.6

0.8

1 Average Poison Success, eps=16
Average Poison Success, eps=8

Epochs pretrained

Av
er

ag
e 

Po
is

on
 S

uc
ce

ss

Figure 1.8: Ablation Studies. Left: avg. poison success for Euclidean Loss, cosine
similarity and the Poison Frogs objective [141] for thin ResNet-18 variants. Right: Avg.
poison success vs number of pretraining epochs.

35


Table 1.6: CIFAR-10 ablation runs. ε = 16, budget is 1%. All values are computed for
ResNet-18 models.

Setup Avg. Poison Success %(±SE) Validation Acc.%
Baseline (full data aug., R = 8, M = 250 91.25% (±6.14) 92.20%

Carlini-Wagner loss instead of L 77.50% (±9.32) 92.08%
Fewer Opt. Steps (M = 50) 40.00% (±10.87) 92.05%

Euclidean Loss instead of cosine sim. 61.25% (±9.75) 92.09%

Table 1.7: CIFAR-10 ablation runs. ε = 16, budget is 1%. All values are computed for
ResNet-18 models. Averaged over 5 runs.

Avg. Poison Success
Setup (% of total targets poisoned successfully) Effective Budget / Target

1 target (Baseline) 90.00% 1%
5 targets 32.00% 0.2%

10 targets 14.00% 0.1%

1.5.10 Transfer Experiments

In addition to the fully black-box pipeline of the AutoML experiments in Section

1.7, we test the transferability of our poisoning method against other commonly used

architectures. Transfer results on CIFAR-10 can be found in Table 1.3. On Imagenet, we

brew poisons with a variety of networks, and test against other networks. We find that

poisons crafted with one architecture can transfer and cause targeted mis-classification in

other networks (see Figure 1.9).

1.5.11 Multi-Target Experiments

We also perform limited tests on poisoning multiple targets simultaneously. We find

that while keeping the small poison budget of 1% fixed, we are able to successfully poison

more than one target while optimizing poisons simultaneously, see Table 1.7. Effectively,

however, every target image gradient has to be matched with an increasingly smaller

36


Figure 1.9: Direct transfer results on common architectures. Averaged over 10 runs with
budget of 0.1% and ε-bound of 16. Note that for these transfer experiments, the model
was only trained on the ”brewing” network, without knowledge of the victim. This shows
a transferability to unknown architectures.

budget. As the target images are drawn at random and not semantically similar (aside

from their shared class), their synergy is limited - for example the 5 targets experiment

reaches an accuracy of 32%, which is only 14% better than the naive baseline of 90%
5

one

might expect. One encouraging result from the standpoint of poisoning though is that the

multiple targets do not compete against each other, canceling out their respective different

alignments.

1.6 Visualizations

We visualize poisoned sample from our ImageNet runs in Figures 1.13, 1.10, noting

especially the ”clean label” effect. Poisoned data is only barely distinguishable from clean

data, even in the given setting where the clean data is shown to the observer. In a realistic

37


Figure 1.10: Clean images (above), with their poisoned counterparts (below) from a suc-
cessful poisoning of a randomly initialized ResNet-18 trained on ImageNet. The poisoned
images (taken from the Labrador Retriever class) successfully caused mis-classification
of a target (otter) image under a threat model given by a budget of 0.1% and an `∞ bound
of ε = 16.

setting, this is significantly harder. A subset of poisoned images used to poison Cloud

autoML with ε = 32 can be found in Figure 1.11.

We concentrate only on small `∞ perturbations to the training data as this is the most

common setting for adversarial attacks. However, there exist other choices for attacks in

practical settings. Previous works have already considered additional color transforma-

tions [68] or watermarks [141]. Most techniques that create adversarial attacks at test time

within various constraints [35, 61, 76, 173] are likely to transfer into the data poisoning

setting. Likewise, we do not consider hiding poisoned images further by minimizing per-

ceptual scores and relate to the large literature of adversarial attacks that evade detection

[19].

In Figure 1.12 we visualize how the adversarial loss and accuracy behave during

an exemplary training run, comparing the adversarial label with the original label of the

target image.

38


Figure 1.11: Clean images (above), with their poisoned counterparts (below) from a suc-
cessful poisoning of a Google Cloud AutoML model trained on ImageNet. The poisoned
images (taken from the Labrador Retriever class) successfully caused mis-classification
of a target (otter) image. This is accomplished with a poison budget of 0.1% and an `∞
bound of ε = 32.

1.7 Experimental Setup

This section details our experimental setup for replication purposes. A central ques-

tion in the context of evaluating data poisoning methods is how to judge and evaluate ”av-

erage” performance. Poisoning is in general volatile with respect to poison-target class

pair, and to the specific target example, with some combinations and target images being

in general easier to poison than others. However, evaluating all possible combinations is

infeasible for all but the simplest datasets, given that poisoned data has to created for each

example and then a neural network has to be trained from scratch every time. Previous

works [141, 178] have considered select target pairs, e.g. ”birds-dogs” and ”airplanes-

frogs”, but this runs the risk of mis-estimating the overall success rates. Another source

of variability arises, especially in the from-scratch setting: Due to both the randomness

of the initialization of the neural network, the randomness of the order in which images

39


0 10 20 30

2

5
0.001

2

5
0.01

2

5
0.1

2

5
1
2

5
10

2
Poisoned ResNet18
Clean ResNet18

Epochs

Ad
ve

rs
ar

ia
l L

os
s 

(L
og

.)

0 10 20 30
100μ

2

5
0.001

2

5
0.01

2

5
0.1

2

5
1
2

5
10 Poisoned ResNet18

Clean ResNet18

Epochs

O
ri

gi
na

l T
ar

ge
t L

os
s 

(L
og

.)
0 10 20 30

0

0.2

0.4

0.6

0.8

1 Poisoned ResNet18
Clean ResNet18

Epochs

Ad
ve

rs
ar

ia
l A

cc
.

0 10 20 30
0

0.2

0.4

0.6

0.8

1 Poisoned ResNet18
Clean ResNet18

Epochs

Ta
rg

et
 A

cc
.

Figure 1.12: Cross entropy loss (Top) and accuracy (Bottom) for a given target with its
adversarial label (left), and with its original label (right) shown for a poisoned and a
clean ResNet-18. The clean model is used as victim for the poisoned model. The loss
is averaged 8 times for the poisoned model. Learning rate drops are marked with gray
horizontal bars.

are drawn during mini-batch SGD, and the randomness of data augmentations, a fixed

poisoned dataset might only be effective some of the time, when evaluating it multiple

times.

In light of this discussion, we adopt the following methodology: For every experi-

ment we randomly select n (usually 10 in our case) settings consisting of a random target

class, random poison class, a random target and random images to be poisoned. For each

of these experiments we create a single poisoned dataset by the discussed or a comparing

method within limits of the given threat model and then evaluate the poisoned datasets

m times (8 for CIFAR-10 and 1 for ImageNet) on random re-initializations of the con-

40


sidered architecture. To reduce randomness for a fair comparison between different runs

of this setup, we fix the random seeds governing the experiment and rerun different threat

models or methods with the same random seeds. We have used CIFAR-10 with ran-

dom seeds 1000000000-1111111111 hyperparameter tuning and now evaluate on random

seeds 2000000000-2111111111 for CIFAR-10 experiments and 1000000000-1111111111

for ImageNet, with class pairs and target image IDs for reproduction given in Tables 1.8

1.9. For CIFAR-10, the target ID refers to the canonical order of all images in the dataset

( as downloaded from https://www.cs.toronto.edu/˜kriz/cifar.html); for

ImageNet, the ID refers to an order of ImageNet images where the syn-sets are ordered by

their increasing numerical value (as is the default in torchvision). However for fu-

ture research we encourage the sampling of new target-poison pairs to prevent overfitting,

ideally even in larger numbers given enough compute power.

For every measurement of avg. poison success in the paper, we measure in the fol-

lowing way: After retraining the given deep neural network to completion, we measure if

the target image is successfully classified by the network as its adversarial class. We do

not count mere misclassification of the original label (but note that this usually happens

even before the target is incorrectly classified by the adversarial class). Over the m valid-

ation runs we repeat this measurement of target classification success and then compute

the average success rate for a single example. We then aggregate this average over our

10 chosen random experiments and report the mean and standard error of these average

success rates as avg. poison success. All error bars in the paper refer to standard error of

these measurements.

41

https://www.cs.toronto.edu/~kriz/cifar.html


Table 1.8: Target/poison class pairs generated from the initial random seeds for ImageNet
experiments. Target ID relative to CIFAR-10 validation dataset.

Target Class Poison Class Target ID Random Seed
dog frog 8745 2000000000
frog truck 1565 2100000000
frog bird 2138 2110000000

airplane dog 5036 2111000000
airplane ship 1183 2111100000

cat airplane 7352 2111110000
automobile frog 3544 2111111000

truck cat 3676 2111111100
automobile ship 9882 2111111110
automobile cat 3028 2111111111

Table 1.9: Target/poison class pairs generated from the initial random seeds for ImageNet
experiments. Target Id relative to ILSVRC2012 validation dataset [137]

Target Class Poison Class Target ID Random Seed
otter Labrador retriever 18047 1000000000

warthog bib 17181 1100000000
orange radiator 37530 1110000000

theater curtain maillot 42720 1111000000
hartebeest capuchin 17580 1111100000

burrito plunger 48273 1111110000
jackfruit spider web 47776 1111111000

king snake hyena 2810 1111111100
flat-coated retriever alp 10281 1111111110

window screen hard disc 45236 1111111111

1.7.1 Cloud AutoML Setup

For the experiment using Google’s cloud autoML, we upload a poisoned ILS-

VRC2012 dataset into google storage, and then use https://cloud.google.com/

vision/automl/ to train a classification model. Due to autoML limitations to 1 mil-

lion images, we only upload up to 950 examples from each class (reaching a training set

size slightly smaller than 950 000, which allows for an upload of the 50 000 validation im-

ages). We use a ResNet-18 model as surrogate for the black-box learning within autoML,

42

https://cloud.google.com/vision/automl/
https://cloud.google.com/vision/automl/


Figure 1.13: Clean images (above), with their poisoned counterparts (below) from a suc-
cessful poisoning of a Google Cloud AutoML model trained on ImageNet. The poisoned
images (taken from the Labrador Retriever class) successfully caused mis-classification
of a target (otter) image under a threat model given by a budget 0.1% and an `∞ bound of
ε = 32.

pretrained on the full ILSVRC2012 as before. We create a MULTICLASS autoML data-

set and specify the vision model to be mobile-high-accuracy-1 which we train to

10 000 milli-node hours, five times. After training the model, we evaluate its performance

on the validation set and target image. The trained models all reach a 69% clean top-1

accuracy on the ILSVRC2012 validation set.

1.7.2 Hardware

We use a heterogeneous mixture of hardware for our experiments. CIFAR-10, and

a majority of the ImageNet experiments, were run on NVIDIA GEFORCE RTX 2080 Ti

gpus. CIFAR-10 experiments were run on 1 gpu, while ImageNet experiments were run

on 4 gpus. We also use NVIDIA Tesla P100 gpus for some ImageNet experiments. All

timed experiments were run using 2080 Ti gpus.

43


1.7.3 Models

For our experiments on CIFAR-10 in section 5 we consider two models. In table

2, the ”6-layer ConvNet”, - in close association with similar models used in Finn et al.

[41] or Krizhevsky et al. [87], we consider an architecture of 5 convolutional layers (with

kernel size 3), followed by a linear layer. All convolutional layers are followed by a

ReLU activation. The last two convolutional layers are followed by max pooling with

size 3. The output widths of these layers are given by 64, 128, 128, 256, 256, 2304. In

tables 1, 2, in the inset figure and Fig. 4 we consider a ResNet-18 model. We make

the customary changes to the model architecture for CIFAR-10, replacing the stem of the

original model (which requires ImageNet-sized images) by a convolutional layer of size 3,

following by batch normalization and a ReLU. This is effectively equal to upsampling the

CIFAR-10 images before feeding them into the model. For experiments on ImageNet, we

consider ResNet-18, ResNet-34 [59], MobileNet-v2 [139] and VGG-16 [145] in standard

configuration.

We train the ConvNet, MobileNet-v2 and VGG-16 with initial learning rate of 0.01

and the residual architectures with initial learning rate 0.1. We train for 40 epochs, drop-

ping the learning rate by a factor of 10 at epochs 14, 24, 35. We train with stochastic mini-

batch gradient descent with Nesterov momentum, with batch size 128 and momentum 0.9.

Note that the dataset is shuffled in each epoch, so that where poisoned images appear in

mini-batches is random and not known to the attacker. We add weight decay with para-

meter 5× 10−4. For CIFAR-10 we add data augmentations using horizontal flipping with

probability 0.5 and random crops of size 32 × 32 with zero-padding of 4. For ImageNet

44


we resize all images to 256× 256 and crop to the central 224× 224 pixels. We also con-

sider horizontal flipping with probability 0.5, and data augmentation with random crops

of size 224× 224 with zero-padding of 28.

When evaluating ImageNet poisoning from-scratch we use the described procedure.

To create our poisoned datasets as detailed in Alg. 1, we download the respective pre-

trained model from torchvision, see https://pytorch.org/docs/stable/

torchvision/models.html.

1.8 Remarks

Remark (Validating the approach in a special case). Inner-product loss functions like

Equation 1.3 work well in other contexts. In [45], cosine similarity between image gradi-

ents was minimized to uncover training images used in federated learning. If we disable

our constraints, setting ε = 255, and consider a single poison image and a single target,

then we minimize the problem of recovering image data from a normalized gradient as a

special case. In [45], it was shown that minimizing this problem can recover the target

image. This means that we can indeed return to the motivating case in the unconstrained

setting - the optimal choice of poison data is insertion of the target image in an uncon-

strained setting for one image.

Remark (Transfer of gradient alignment). An analysis of how gradient alignment often

transfers between different parameters and even between architectures has been conduc-

ted, e.g. in [22, 82] and [29]. It was shown in [29] that the performance loss when

transferring an evasion attack to another model is governed by the gradient alignment of

45

https://pytorch.org/docs/stable/torchvision/models.html
https://pytorch.org/docs/stable/torchvision/models.html


both models. In the same vein, optimizing alignment appears to be a useful metric in the

case of data poisoning. Furthermore [62] note that previous poisoning algorithms might

already cause gradient alignment as a side effect, even without explicitly optimizing for

it.

Remark (Poisoning is a Credible Threat to Deep Neural Networks). It is important to

understand the security impacts of using unverified data sources for deep network training.

Data poisoning attacks up to this point have been limited in scope. Such attacks focus on

limited settings such as poisoning SVMs, attacking transfer learning models, or attacking

toy architectures [10, 111, 141]. We demonstrate that data poisoning poses a threat to

large-scale systems as well. The approach discussed in this work pertains only to the

classification scenario, as a guinea pig for data poisoning, but applications to a variety

of scenarios of practical interest have been considered in the literature, for example spam

detectors mis-classifying a spam email as benign, or poisoning a face unlock based mobile

security systems.

The central message of thedata poisoning literature can be described as follows:

From a security perspective, the data that is used to train a machine learning model should

be under the same scrutiny as the model itself. These models can only be secure if the

entire data processing pipeline is secure. This issue further cannot easily be solved by

human supervision (due to the existence of clean-label attacks) or outlier detection (see

Figure 1.5a). Furthermore, targeted poisoning is difficult to detect as validation accuracy

is unaffected. As such, data poisoning is best mitigated by fully securing the data pipeline.

So far we have considered data poisoning from the industrial side. From the per-

46


spective of a user, or individual under surveillance, however, data poisoning can be a

means of securing personal data shared on the internet, making it unusable for automated

ML systems. For this setting, we especially refer to an interesting application study in

[142] in the context of facial recognition.

1.9 Conclusion

We investigate data poisoning via gradient matching and discover that this mech-

anism allows for data poisoning attacks against fully retrained models that are unpre-

cedented in scale and effectiveness. We motivate the attack theoretically and empiric-

ally, discuss additional mechanisms like differentiable data augmentation and experiment-

ally investigate modern deep neural networks in realistic training scenarios, showing that

gradient matching attacks compromise even models trained on ImageNet. We close with

discussing the limitations of current defense strategies.

47


Chapter 2: Availability Poisoning Attacks

The adversarial machine learning literature is largely partitioned into evasion at-

tacks on testing data and poisoning attacks on training data. In this work, we show that

adversarial examples, originally intended for attacking pre-trained models, are even more

effective for data poisoning than recent methods designed specifically for poisoning. Our

findings indicate that adversarial examples, when assigned the original label of their nat-

ural base image, cannot be used to train a classifier for natural images. Furthermore,

when adversarial examples are assigned their adversarial class label, they are useful for

training. This suggests that adversarial examples contain useful semantic content, just

with the “wrong” labels (according to a network, but not a human). Our method, ad-

versarial poisoning, is substantially more effective than existing poisoning methods for

secure dataset release, and we release a poisoned version of ImageNet, ImageNet-P, to

encourage research into the strength of this form of data obfuscation. This work was per-

formed together with Micah Goldblum, Ping-yeh Chiang, Jonas Geiping, Wojtek Czaja,

and Tom Goldstein. My contributions include: jointly conceiving of the targeted poison-

ing objective (the more powerful objective), performing a majority of experiments in the

work, as well as writing a majority of the paper.

48


2.1 Introduction

Automated dataset scraping has become necessary to satisfy the exploding demands

of cutting-edge deep models [14, 17], but the same automation that enables massive per-

formance boosts exposes these models to security vulnerabilities [4, 23]. Recall, data

poisoning attacks manipulate training data in order to cause the resulting models to mis-

classify samples during inference [81], while backdoor attacks embed exploits which can

be triggered by pre-specified input features [25]. In this work, we focus on a flavor of

data poisoning known as availability attacks, which aim to degrade overall testing per-

formance [6, 9].

Adversarial attacks, on the other hand, focus on manipulating samples at test-time,

rather than during training [154]. In this work, we connect adversarial and poisoning

attacks by showing that adversarial examples form stronger availability attacks than any

existing poisoning method, even though the latter were designed specifically for manip-

ulating training data while adversarial examples were not. We compare our method, ad-

versarial poisoning, to existing availability attacks for neural networks, and we exhibit

consistent performance boosts (i.e. lower test accuracy). In fact, models trained on ad-

versarial examples may exhibit test-time performance below that of random guessing.

Intuitively, adversarial examples look dramatically different from their natural base

images in the eye of neural networks, despite the two looking similar to humans. Models

trained only on such perturbed examples are completely unprepared for inference on the

clean data. In support of this intuition, we observe that models trained on adversarially

perturbed training data often fail to correctly classify the original clean training samples.

49


But does this phenomenon occur simply because adversarial examples are off the

“natural image manifold” or because they actually contain informative features from other

classes? Popular belief assumes that adversarial examples live off the natural image mani-

fold, causing a catastrophic mismatch when digested by models trained on only clean data

[79, 150, 177]. However, models trained on data with random additive noise (rather than

adversarial noise) perform well on noiseless data, suggesting that there may be more to

the effects of adversarial examples than simply moving off the manifold (see Table 2.2).

We instead find that since adversarial attacks inject features that a model associates with

incorrect labels, training on these examples is similar to training on mislabeled training

data. After re-labeling adversarial examples with the “wrong” prediction of the network

on which they were crafted, models trained on such label-flipped data perform substan-

tially better than models trained on uncorrected adversarial examples and almost as well

as models trained on clean images. While this label-correction is infeasible for a practi-

tioner defending against adversarial poisoning, since it assumes possession of the crafting

network which requires access to the clean dataset, this experiment strengthens the intu-

ition that adversarial examples contain a strong training signal just from the “wrong”

class.

2.2 Related Work

Data poisoning can generally be phrased as a bilevel optimization problem which

minimizes loss with respect to parameters in the inner problem while maximizing some

attack loss with respect to inputs in the outer problem [9, 67]. Poisoning comes in several

50


flavors, including integrity attacks and availability attacks. The former aims to cause

targeted misclassification on a small number of pre-selected datapoints, while the latter

aims to degrade the overall performance (generalization ability) of a victim network [6].

Classical approaches to poisoining attacks often focused on simple models, where the

inner problem can sometimes be solved exactly [9, 21, 105, 167]. However, on neural

networks, obtaining exact solutions is intractable. In this setting, Muñoz-González et al.

[112] approximates a solution to the inner problem using a small number of descent steps,

but the authors note that this method is ineffective against deep neural networks.

Modern poisoning attacks adopt new methods and approximations, like gradient

alignment [47, 148], computation graph unrolling [68], etc., but for integrity attacks on

deep networks. On the availability attack side, still other heuristics have been adopted.

For example, gradient alignment [47] with a modified indiscriminate objective was used

in Fowl et al. [42], while gradient explosion was suggested in Shen et al. [143]. Other

availability attacks have used auto-encoder generated perturbations [39], as well as loss

minimization objectives [65]. Notably, it was previously believed that adversarial (loss

maximization) objectives were not suitable to availability attacks [65].

Still other related poisoning works harness influence functions which estimate the

impact of each training sample on a resulting model [38, 81, 83]. However, influence

functions are brittle on deep networks whose loss surfaces are highly irregular [7]. A

general overview of data poisoning methods can be found in [52].

Adversarial examples. Adversarial attacks probe the blindspots of trained models

where they catastrophically misclassify inputs that have undergone small perturbations

[154]. Prototypical algorithms for adversarial attacks simply maximize loss with respect

51


to the input while constraining perturbations. The resulting adversarial examples exploit

the fact that as inputs are even slightly perturbed in just the right direction, their corres-

ponding deep features and logits change dramatically, and gradient-based optimizers can

efficiently find these directions. The literature contains a wide array of proposed loss

functions and optimizers for improving the effectiveness of attacks [18, 57]. A number of

works suggest that adversarial examples are off the image manifold, and others propose

methods for producing on-manifold attacks [79, 150, 177].

Adversarial training. The most popular method for producing neural networks

that are robust to attacks involves crafting adversarial versions of each mini-batch and

training on these versions [99]. On the surface, it might sound as if adversarial training

is very similar to training on poisons crafted via adversarial attacks. After all, they both

involve training on adversarial examples. However, adversarial training ensures that the

robust model classifies inputs correctly within a ball surrounding each training sample.

This is accomplished by updating perturbations to inputs throughout training. This pro-

cess desensitizes the adversarially trained model to small perturbations to its inputs. In

contrast, a model trained on adversarially poisoned data is only encouraged to fit the ex-

act, fixed perturbed data.

2.3 Adversarial Examples as Poisons

In this section, we describe the central mechanism for crafting availability attack.

We formally introduce the objective of the attacker, and describe our approach, and com-

pare to several existing methods.

52


2.3.1 Threat Model and Motivation

We introduce two parties: the poisoner (sometimes called the attacker), and the

victim. The poisoner has the ability to perturb the victim’s training data but does not know

the victim’s model initialization, training routine, or architecture. The victim then trains

a new model from scratch on the poisoned data. The poisoner’s success is determined by

the accuracy of the victim model on clean data.

Early availability attacks that worked best in simple settings, like SVMs, often

modified only a small portion of the training data. However, recent availability attacks

that work in more complex settings have instead focused on applications such as secure

data release where the poisoner has access to, and modifies, all data used by the victim

[39, 42, 65, 143].

These methods manipulate the entire training set to cause poor generalization in

deep learning models trained on the poisoned data. This setting is relevant to practitioners

such as social media companies who wish to maintain the competitive advantage afforded

to them by access to large amounts of user data, while also protecting user privacy by

making scraped data useless for training models. Practically speaking, companies could

employ methods in this domain to imperceptibly modify user data before dissemination

through social media sites in order to degrade performance of any model which is trained

on this disseminated data.

To compare our method to recent works, we focus our experiments in this setting

where the poisoner can perturb the entire training set. However, we also poison lower

proportions of the data in Tables 2.5, 2.4. We find that on both simple and complex

53


datasets, our method produces poisons which are useless for training, and models trained

on data including poisons would have performed just as well had they identified and

thrown out the poisoned data altogether.

2.3.2 Problem Setup

Formally stated, availability poisoning attacks aim to solve the following bi-level

objective in terms of perturbations ∆ = {∆i} to elements xi of a dataset T :

max
∆∈C

E(x,y)∼D

[
L (F (x; θ(∆)), y)

]
(2.1)

s.t. θ(∆) ∈ arg min
θ

∑
(xi,yi)∈T

L(F (xi + ∆i; θ), yi), (2.2)

where C denotes the constraint set of the perturbations, and D denotes the distribution

from which T was drawn. As is common in both the adversarial attack and poisoning

literature, we employ an `∞ bound on each δi. Unless otherwise stated, our attacks are

bounded by `∞-norm ε = 8/255 as is standard practice on CIFAR-10, and ImageNet data

in both adversarial and poisoning literature [47, 99]. Simply put, the attacker wishes to

cause a network, F , trained on the poisons to generalize poorly to distribution D from

which T was sampled.

Directly solving this optimization problem is intractible for neural networks as it

requires unrolling the entire training procedure found in the inner objective (Equation (2))

and backpropagating through it to perform a single step of gradient descent on the outer

objective. Thus, the attacker must approximate the bilevel objective. Approximations to

this objective often involve heuristics, as previously described. For example, TensorClog

54


[143] aims to cause gradient vanishing in order to disrupt training, while more recent

work aims to align poison gradients with an adversarial objective [47].

We opt for an entirely different strategy and instead replace the bi-level problem

with two empirical loss maximization problems - an approach that was believed to be

suboptimal for availability poisoning [65]. This turns the poison generation problem into

an adversarial example problem. Specifically, we optimize the following untargeted (UT)

objective:

max
δ∈S

[ ∑
(xi,yi)∈T

L (F (xi + δi; θ
∗), yi)

]
, (2.3)

where θ∗ denotes the parameters of a model trained on clean data, which is fixed during

poison generation. We call this model the crafting model.

We also optimize an objective which defines a class targeted (CT) adversarial at-

tack. This modified objective is defined by:

min
δ∈S

[ ∑
(xi,yi)∈T

L (F (xi + δi; θ
∗), g(yi))

]
, (2.4)

where g is a permutation (with no fixed points) on the label space of S. Fittingly, we

call our methods adversarial poisoning. Note that the class targeted objective was pre-

viously (independently) hypothesized to produce potent poisons in Nakkiran [113], and

also tested in a work concurrent to ours [158].

55


Projected Gradient Descent (PGD) has become the standard method for generating

adversarial examples for deep networks [99]. Accordingly, we craft our poisons with 250

steps of PGD on this loss-maximization objective. In addition to the adversarial attack

introduced in Madry et al. [99], we also experiment with other attacks such as FGSM [55]

and Carlini-Wagner [18] in Table 2.2. We find that while other adversaries do produce

effective poisons, a PGD based attack is the most effective in generating poisons. Finally,

borrowing from recent targeted data poisoning works, we also employ differentiable data

augmentation in the crafting stage [47] (see section 2.3.3).

An aspect of note for our method is the ease of crafting perturbations - we use a

straightforward adversarial attack on a fixed pretrained network to generate the poisons.

This is in contrast to previous works which require pretraining an adversarial auto-encoder

[40] (5 - 7 GPU days for simple datasets), or require iteratively updating the model and

perturbations [65], which requires access to the entire training set all at once - an assump-

tion that does not hold for practitioners like social media companies who acquire data

sequentially. In addition to the performance boosts our method offers, it is also the most

flexible compared to each of the availability attacks with which we compare.

2.3.3 Technical Details for Successful Attacks

As we will see in the following sections, adversarial objectives can indeed produce

powerful poisons. As previously stated, such loss maximization approaches to availability

attacks were thought to be suboptimal [65]. However, we find that differentiable data

augmentation during crafting, along with mor