nips 2017 anti sample attack and defense competition summary (with learning materials)

Posted by trammel at 2020-03-04

Source: AI Technology Review

Author: gaoyunhe

There are 8989 words in this paper. It is suggested to read for 10 minutes. This competition summary was jointly written by Google brain, Tsinghua University and other participating researchers to introduce nips 2017 sample attack and defense competition.

Since Ian goodsell and other researchers found the "adversarial sample" which can make the image classifier give abnormal results, there are more and more researches on adversarial samples. In nips 2017, Ian goodsell also took the lead in organizing the advanced attacks and defenses competition for researchers and developers to deepen their understanding of the antagonistic sample phenomenon and related technical means in the actual attack defense competition.

At the end of the competition, researchers from Google brain, Tsinghua University and other participating enterprises and schools jointly wrote a summary of the competition. Among them, Dong yinpeng, Liao Fangzhou, Pang Tianyu, doctoral students from Tsinghua University, and Zhu Jun, Hu Xiaolin, Li Jianmin, and Su Hang, the instructors, won the championship in all three projects in the competition. We compile the main contents of this competition summary as follows.

brief introduction

With the rapid development of machine learning and deep neural network, researchers can solve many important practical problems such as image, video, text classification and so on. However, the current machine learning classifiers are vulnerable to the attack of counter samples. The so-called counter sample is to slightly modify the input data, so that the machine learning algorithm gives the wrong classification results to the input. In many cases, these changes are so subtle that human observers will not even notice them, but the classifier will make mistakes. It is a challenge to the current machine learning system to resist the sample attack, because even if the attacker can not access the basic model, he can attack the machine learning system.

In addition, machine learning systems running in the real world may also encounter sample attack. They perceive input through inaccurate sensors, rather than using accurate digital information as input. In the long run, machine learning and AI systems will become more powerful. Machine learning security vulnerabilities like those against samples can harm or even control powerful AI people. Therefore, the robustness of the counter samples is an important part of the AI security problem.

There are many difficulties in the research of attack and defense against samples. One of the reasons is that it is difficult to evaluate whether a proposed attack method or defense measure is effective. For the traditional machine learning, if the training set and test set have been separated from the independent and equally distributed data set, then the model can be evaluated by calculating the loss of the test set, which is a very simple method. However, for adversary machine learning, the defender must deal with an open problem, that is, the attacker will send input from unknown distribution. It is not enough to provide a defense against a single attack or a series of defense methods that researchers prepared in advance. Because even if the defender performs well in such an experiment, the machine learning system may be broken due to some unexpected attacks by the defender. In an ideal situation, a defense method can be proved to be feasible, but usually machine learning and neural network are difficult to analyze theoretically. Therefore, this competition uses an effective evaluation method: multiple independent teams as the defensive side and the attacking side to fight, and both sides try to win as much as possible. Although this evaluation method is not as decisive as the theoretical proof, it is more similar to the attack and defense confrontation in real life.

In this article, we introduced the nips 2017 attack and defense competition against the sample, including some key issues in the study of attack against the sample (Part II), competition organization structure (Part III), and some methods used by some top competitors (Part IV).

Confrontation sample

Common attack scenarios

The possible counter sample attacks can be classified from several dimensions.

First, it can be classified by the target or expectation of the attack.

Non targeted attack. In this case, the target of the attacker is only to make the classifier give an error prediction. It is not important which category produces the error.

Targeted attack. In this case, the attacker wants to change the prediction result to some specified target category.

Secondly, the degree of understanding of the model can be classified by attackers.

White box attack. The attacker has all the knowledge of the model, including the model type, model structure, all parameters and the values of trainable weights.

Black box attack with probes. Attackers don't know much about the model, but they can probe or query the model, such as using some inputs to observe the output of the model. There are many variations in this scenario, such as the attacker knows the model structure, but does not know the parameter value, or the attacker does not even know the model architecture; the attacker may be able to observe the probability of each category of model output, or the attacker can only see the most likely category name of model output.

Black box attack without probe. In a black box attack without a probe, the attacker only has limited or no information about the model, and does not allow the use of probe or query methods to build countermeasures samples. In this case, the attacker must construct the counter samples which can cheat most machine learning models.

Thirdly, the model can be classified by the way that the attacker inputs the counter samples to the model.

Digital attack. In this case, the attacker can directly input the actual data into the model. In other words, an attacker can use specific float32 data as input to the model. In the real world, an attacker may upload PNG image files to some network services, so that these carefully designed files will be read incorrectly. For example, in order to make some spam content can be published on social networks, it is necessary to add anti-interference to image files to bypass the spam content detector.

Physical attack. In this case in the real world, the attacker can not directly provide a digital representation to the model. However, the input to the model is from some sensors, such as a camera or microphone. The attacker can place something in front of the camera or play some sound on the microphone. The final representation obtained by the sensor will change according to some factors, such as the angle of the camera, the distance to the microphone, the light and sound of the surrounding environment, etc. This means that the attacker's control of machine learning model input is not very precise.

Attack method

First, white box digital attack


One of the first methods to find out the neural network against sample attack is the introduction properties of natural networks paper by goodsell et al. The idea of this method is to solve the following optimization problems:

The author proposes to use l-bfgs optimization method to solve this problem, so this attack method is named after it.

One of the main disadvantages of this method is its slow running speed. This method aims to find the smallest possible attack disturbance. This means that the method may sometimes be defeated only by reducing the image quality, such as rounding each pixel of the image to the nearest 8-bit binary number.

2. FGSM(Fast gradient sign method)

In order to verify the idea that only the linear approximation of the target model is needed to obtain the countermeasure samples, I.J. goodsell, J. shrins, and C. Szegedy et al. Proposed the fast gradient sign function method (FGSM) in the paper of explaining and harnessing adaptive examples.

The FGSM method linearizes the loss function in the neighborhood of the clean sample, and finds the exact maximum value of the linearization function through the following closed form equation:

3. Iterative attacks

The l-bfgs attack has a high success rate, but the calculation cost is also high. The success rate of FGSM attack is very low (especially when the defender predicts that an attack will occur), but the calculation cost is very low. In order to get a good compromise, after a small number of iterations (such as 40 times), repeat the special optimization algorithm to get the results quickly.

A strategy for rapid design of optimization algorithms is to use FGSM (generally acceptable results can be obtained in a large step), but run it several times in a small step. This is because each step of FGSM is designed to go all the way to the edge of a small sphere at the starting point of the step, so even if the gradient is very small, the method can make rapid progress. This also leads to the basic iterative method (BIM) in the paper of advanced examples in the physical world, sometimes referred to as the iterative FGSM (i-fgsm):

BIM method can be easily improved to target attack, which is called iterative target class method

According to the experimental results, after running enough iterations, the method can always generate the target class confrontation samples successfully.

4. Madry attack

According to this paper, BIM can be significantly improved by starting from random points in the sphere of ε range. This attack is often referred to as "projected gradient descent," but the name is somewhat confusing because:

The term "projection gradient descent" has been used to refer to a more general optimization method, not just to counter attacks;

Some other attack methods also use gradients and projections (madry attack is different from BIM only at the starting point), so this name does not distinguish madry attack from other attacks.

5. Carlini and Wagner attacks (C & W)

N. Carlini and D. Wagner follow the path of l-bfgs attack to further improve. They design a loss function, which has a smaller value in the confrontation samples, but a larger value in the clean samples, so by minimizing the loss function, we can find the confrontation samples. However, unlike the l-bfgs method, this method uses Adam to solve this optimization problem, by changing variables (such as) or projecting the results to the box constraints after each step, to solve the problem of boundary constraints.

They tried several possible loss functions, among which the following loss function obtained the strongest L2 attack:

Where xadv is parameterized; y is the shorthand for the target category ytarget; C and K are parameters.

6. Adversarial transformation networks (ATN)

In the paper of advanced transformation networks: learning to generate adversarial examples, a method of training a generation model to generate confrontation samples is proposed. This model uses clean samples as input to generate corresponding countermeasures samples. One point of this method is that if the generation model itself is a very small network, then ATN can generate the confrontation samples faster than using the optimized algorithm. Theoretically, this method is even faster than FGSM. Although ATN does need more time to train, once trained, a large number of confrontation samples can be generated at a very low cost.

7. Attacks on non differential systems

All the attacks mentioned above need to calculate the gradient of the attacked model to generate the counter samples. However, these methods are not always feasible, for example, if the model contains non differentiable operations. In this case, the attacker can train an alternative model to attack the non micro system with the mobility of the counter samples, which is similar to the black box attack mentioned below.

Second, black box attack

According to the results of introducing properties of natural networks, the adversary samples can be generalized among different models. In other words, it can cheat one model's counter samples, and in most cases it can cheat other different models. This model is called transferability, which can be used to make confrontation samples in black box scenarios. Depending on the source model, the target model, the data set, and other factors, the actual proportion of migrated countermeasure samples may vary from a few percentage points to 100%. In the black box scenario, an attacker can train his model on the same dataset as the target model, or even use other datasets with the same distribution to train the model. The adversary samples of the attacker training model are likely to cheat the unknown target model.

Of course, we can design the model systematically to improve the success rate of counter sample migration, rather than relying on luck to achieve migration.

If the attacker is not in the complete black box scenario and is allowed to use the probe, the probe can be used to train the attacker's own copy of the target model, which is called the "substitute". This method is very powerful, because the input samples as probes do not need to be real training samples, they can be specifically selected by attackers, and they can accurately find the samples where the decision boundary of the target model is located. Therefore, the attacker's model can not only be trained as a good classifier, but also can reverse engineer the details of the target model, so the two models can have a high success rate of sample migration from a systematic point of view.

In the black box scenario where the attacker cannot send the probe, one of the strategies to increase the success rate of sample migration is to make multiple models into a set as the source model to generate countermeasures samples. The basic idea here is that if an adversary sample can cheat every model in the whole set, it is more likely to generalize and cheat other models.

Finally, in the black box scene with detector, we can run the optimization algorithm of attacking the target model directly without gradient. The time needed to generate a single countermeasure sample is usually much longer than the time needed to use the "substitute", but if only a small number of countermeasure samples are needed, these methods may be more advantageous, because they do not need the high cost of training the "substitute".

A survey of defense methods

Up to now, there is no method to defend the counter sample to a completely satisfactory degree. This is still a fast-growing research area. Here we will outline the methods proposed so far (defense methods that have not been fully successful).

Because the confrontational disturbance produced by many methods looks like high-frequency noise to human observers, many researchers suggest using image preprocessing and denoising as potential defense methods against the confrontational samples. There are many preprocessing methods, such as JPEG compression, median filtering and reducing the accuracy of input data. Although these defense measures may be very good for many attack methods, this method has been proved to fail in the white box attack scenario, that is, the attacker knows that the defender will use this preprocessing or denoising method for defense. In the case of black box attack, this kind of defense has achieved effective results in practice, and the winning team of this defense competition has also proved this. As we will see later, they use one of many denoising schemes.

Many defense methods, intentionally or unintentionally, use a "gradient masking" method. Most white box attacks run by calculating the gradient of the model, so if the effective gradient cannot be calculated, the attack will fail. Gradient masking makes the gradient useless, which is usually by changing the model to some extent, making it nondifferentiable, or making it have zero gradient in most cases, or the gradient point is far away from the decision boundary. In fact, what gradient masking does is to make the optimizer not work without actually moving the decision boundary. However, because the decision boundary of the model trained with the same distributed data set is not much different, the defense method based on gradient masking is easy to be broken by the black box migration attack, because the attacker can get gradient in the "substitute". Some defense strategies (such as replacing smooth sigmoid units with hard thresholds) are directly designed to perform masking. Other defense measures, such as many forms of confrontation training, do not take gradient masking as the goal design, but in practice, they do the same work as gradient masking.

There are also some defense measures to detect the counter samples first, and reject the input for classification when there are signs that someone tampers with the input image. This method works only when the attacker does not know the existence of the detector or the attack is not strong enough. Otherwise, the attacker can construct an attack, which not only deceives the detector to think that the counter sample is a legitimate input, but also deceives the classifier to make a wrong classification.

Some defenses are effective, but at the expense of significantly reducing the accuracy of clean samples. For example, the shallow RBF network has a strong robustness to small data sets (such as Minist), but it has a much lower accuracy than the deep neural network on clean samples. The deep RBF network may have good robustness against the samples and good accuracy against the clean samples. However, as far as we know, no one has successfully trained a deep RBF network.

Capsule network shows good robustness to white box attacks on small norB datasets, but it has not been evaluated on other more commonly used datasets in the study of anti sample.

The most popular defense method in the current research paper may be adversary training. This idea is to add confrontation samples in the training process, and use the hybrid training model of confrontation samples and clean samples. This method has been successfully applied to large data sets, and can further improve the efficiency by using discrete vector code to represent the input. One of the key disadvantages of adversary training is that it is easier to fit into specific attack methods used in training. However, this disadvantage can be overcome, at least in small datasets, by adding noise to the image before launching the attack optimizer. Another major disadvantage of adversary training is that it tends to learn to do gradient masking unintentionally instead of actually moving decision boundaries. This can be overcome to a large extent by training the counter samples from multiple model sets. Another key disadvantage of adversary training is that it tends to over fit into a specific constraint region used to generate adversary samples (that is, the model is trained to reject a max norm However, we can't reject the antagonistic samples which are not particularly challenging even for human observers, such as the antagonistic samples which greatly modify the background pixels.

Confrontation competition

In this part, we will briefly introduce the rules, objectives and evaluation criteria of this competition.

The purpose of this competition is to speed up the research of adversary machine learning and make many adversary attack and defense methods compete with each other to find the most powerful evaluation criteria.

In this competition, participants are required to submit the generated counter samples (attackers) and classifiers (defenders) robust to the counter samples. When evaluating the results of a match, all attack methods are run on the data set to generate countermeasures samples, which are then tested on the defense side. The attacking party ranks according to the number of counter samples that can successfully deceive the defending party, while the defending party scores according to the number of samples that can be correctly classified.

data set

In order to evaluate the various algorithms involved in the competition as well as possible, the organizers of the competition made the data set according to the following standards:

There is a large enough data set, and it is a meaningful task, so that the competition results are meaningful.

It is a well-known task, so that people can use existing classifiers, which enables participants to better focus on the nature of the challenge, rather than waste a lot of time in understanding the task.

The data must never be used so that participants do not fit into a known data set.

Therefore, the organizers of the competition chose image classification as the task of this competition, and built a new dataset compatible with Imagenet. Firstly, the organizer selects a group of images that have never been used in the public dataset, and uses the pre trained Imagenet classifiers to classify them, and then manually verifies whether the classification results of these classifiers are correct.

This new dataset is divided into two parts:

Dev data set, which is provided to competitors at the beginning of the competition, is used to develop competition algorithm, and this part includes 1000 pictures.

Final data set, which is not provided to competitors, is used to evaluate the algorithm submitted by competitors at the end. This section contains 5000 pictures.

Mission and competition rules

The competition consists of three parts:

No target counter attack, contestants need to submit no target black box attack method. For example, given an input image, an adversary image is generated to make an unknown classifier give the wrong classification result as much as possible.

There are targets to fight against and attack. Participants need to submit a target black box attack method. For example, given an input image and a target category, an adversary image is generated, and an unknown classifier classifies the adversary image as a target category as much as possible.

In order to defend against attacks, participants need to submit a classifier robust to the counter samples.

For each submitted result, the evaluation program will take 100 pictures of each batch as input. In order to limit the computing resources required for the submitted results, the organizer stipulates that 100 images in each batch need to be processed within 500 seconds. If the processing is not completed within 500 seconds, the processing program of the current batch will be interrupted and 100 images of the next batch will be input. Only successfully processed images are included in the score.

evaluation criterion

Each time there is a picture that can successfully cheat the defender, the attacker gets 1 point. If the attacker does not cheat the defender, or fails to generate a confrontation sample image within the time limit, 0 point will be given to the image. For each correctly classified picture, the defender gets 1 point; for the picture that is not correctly classified or fails to generate classification, the defender gets 0 point. Finally, all scores are added together and normalized, so that the final score of the results submitted by all teams is within the range of [0,1], where 1 means all images are successful, and 0 means all images are failed.

Rules of the game

The competition started in July 2017 and ended on October 1, 2017. There are multiple rounds in the whole competition, including three development rounds and one final round. The main purpose of the development round is to let participants test their results. Only the final round is used to calculate the final score and decide who is the champion.

Result of competition

Until the end of the competition, the organizers received 91 submission results of non target confrontation attacks, 65 submission results of target confrontation attacks, and 107 submission results of defense parties.

This figure shows the improvement of the submission result with the highest score in each round compared with baseline. It can be seen that the performance of the submission result has improved in each round.

The following three tables are the results of top-5 team in defense competition, non target attack competition and target attack competition. Among them, score is the score after normalization, raw score is the score without normalization, while worst score is the score in the worst case, and medical Eval time is the time required to process a batch of 100 pictures.

From the results in the table, we can see that the best defense method can get more than 90% accuracy in all the confrontation images. But at the same time, in the worst case of defense, the highest accuracy is only 53.6%. This means that although on average, the accuracy of the counter samples is very high, there are still ways to attack the model, which may be found by the attacker.

The tsail team from Tsinghua University won the championship in defense, non target and targeted attack.

The figure above shows the scores of each team in the three competitions, and the solid line represents the scores of each ranked team. Dashed lines represent different baseline methods.

Top team approach

This section will show solutions for several of the highest scoring teams.

Champion of defense competition: tsail team of Tsinghua University

The team used a high-level representation guided noise canceller (HGD), the overall idea is to train a neural network-based noise canceller to remove the anti disturbance.

In order to train the denoiser, the team extracted 20K pictures (20 pictures in each category) from the Imagenet training set. Then we use a variety of anti attack methods to process these images and generate training sets. FGSM and i-fgsm are used as attack methods, and these attacks are applied to a variety of models, and finally combined with a variety of models to simulate attacks of various strengths.

The team uses u-ne t as a denoising network. Compared with the encoder decoder structure, u-net is directly connected between the coding layer and the decoding layer of the same resolution, so the network only needs to learn how to remove the noise without learning how to reconstruct the whole picture.

The team did not train the network using the common reconstruction distance loss. They use the higher-order representation of the network layer L as a loss function, which is called HGD. Because these surveillance signals come from the high-order representation of the classifier, and these features usually contain guidance information related to image classification.

For details, please refer to the team's paper:

Defense against adversarial attacks using high-level representation guided denoiser

Two kinds of attack Champion: Tsinghua University tsail team

The team won both the non target attack and the target attack. They used the momentum iterative gradient based attack.

The momentum iteration attack method is based on the basic iteration method (BIM), which increases the momentum term and greatly improves the mobility of the generated countermeasure samples. Due to the tradeoff between attack strength and mobility, the current attack algorithms often have low efficiency in attacking black box models. In detail, some single-step methods, such as FGSM, assume that the decision boundary near the data point is linear, and calculate the primary gradient accordingly. However, in reality, due to the large distortion, the linear assumption is often not tenable, which makes the single step method can not fit the model well, making the attack intensity low. Instead, the basic iterative method uses the greedy method to move the counter samples along the gradient direction in each iteration. Therefore, it is easy to over fit the counter samples to local extreme points, which makes the migration between models difficult.

In order to solve this problem, the team adds momentum to the basic iterative model to stabilize the updating direction, so that the optimization can escape from the local extreme point. This approach is similar to the benefits of momentum in optimization tasks. Therefore, this method reduces the tradeoff between attack strength and mobility, and shows a strong black box attack ability.

For details, please refer to the paper:

Boosting adversarial attacks with momentum

Second place in defense competition: iyswim team

The team used a random approach to defend against sample attacks. The idea is very simple, just before the classification of the network, we add a random scaling layer (resize) and a random padding layer (padding). The benefits of doing so are:

No need for fine tune network;

Only a few calculations are added;

Compatible with other defense methods

By combining this random method with the model trained in the confrontation samples, the team ranked second in the competition.

For details, please refer to the paper:

Mitigating adversarial effects through randomization

Second place in two attack competitions: Sangxia team

The team uses an iterative FGSM attack against multiple classifiers, which combines random disturbances and enhancements to improve robustness and portability. Many works have proved that the counter samples for one classifier can be transferred to other classifiers. Therefore, in order to attack an unknown classifier, a very natural idea is to generate samples that can cheat a series of classifiers. In addition, the team also uses image enhancement to improve robustness and portability.

Open source:

Third place in the targeted attack competition: fatfinger team

The method used by this team is similar to Sangxia team, which uses iterative attack method combined with multiple models.

Fourth place in the non target attack competition: iwiwiwi team

This team's method is very different from other teams. They trained a full convolution neural network to convert clean samples into counter samples and got the fourth place.

Given an input image x, the countermeasure sample is generated according to the following formula:

Where a is a differentiable function representing FCN, theta is the parameter of the network. A is called attack full convolution network. The output of the network is anti disturbance. Therefore, in order to confuse the image classifier, the team maximizes J (f (xadv), y) to train the network, that is, maximization:

The team also used multi-objective training, multi task training and gradient hints to improve performance. At the same time, the team observed that all texture information was lost in the confrontation samples generated by the attack network, and some similar patterns were added, which can cheat the image classifier and classify the confrontation samples into the category of the puzzle patterns.

More information can be found on GitHub:


Counter sample is an interesting phenomenon and an important problem in the field of machine learning security. The main purpose of this competition is to improve researchers' understanding of the problem and encourage researchers to propose new methods.

Competition does help to raise awareness of the problem, but it also enables people to explore new methods and improve existing solutions. In all three events, participants submitted a much better method than baseline before the end of the competition. In addition, the best defense method can still get 95% accuracy for all the attack samples. Although the worst-case accuracy is not as good as the average accuracy, the results can still show that the practical application in the case of black box still has enough robustness in the face of sample attack.