Jon Paynter Posting for 6.S898.

How Do We Attack a Contrastive Model?

Learn about the effectiveness of data poisoning attacks on contrastive learners with a summary of a recent paper.

outline

TL;DR

  • This post provides an overview of Poisoning and Backdooring Contrastive Learning by Carlini and Terzis.
  • A contrastive learning model determines a representation of unlabeled input data pairs where elements that are similar are close together, and dissimilar elements are contrasted, and further apart.
  • Because a contrastive model determines a representation of the data, an attack on a contrastive model needs to ensure that a certain input is represented incorrectly - that it is embedded in the vector space in the ‘wrong’ area.
  • The primary result for Carlini and Terzis is that by poisoning just 0.0001% of the input data or applying a backdoor trigger patch to 0.005% of the data, their malicious attacks succeed.
  • This highlights a pitfall for models that are trained on very large, uncurated, web-scraped data sets.

1 What is contrastive learning?

A contrastive learner outputs a new representation of input data for use in downstream tasks (classification, zero-shot learner, etc).

A contrastive model learns a representation of the input data, which can then be leveraged in downstream tasks to better effect than the original data representation. There are many possible downstream tasks including classification and zero-shot learning. A contrastive model can be multi-modal; in this post we focus on image and text data. The learner takes paired input data - such as images with text captions - and learns a lower dimensional embedding for each, where the embeddings of the paired inputs are similar. At initial glance, there are aspects of this that might seem like supervised learning - is the caption playing the role of the label for the image?

Not really. There are important differences between a supervised machine learning model that classifies images into categories and a contrastive model that takes images and captions as inputs. Because there is no parsing of the text caption to select a single label, the contrastive model leverages all of the caption’s text. For example, the below figure shows the goal: capture a representation of the images (green) and a representation of the text (purple) so that the embedding maximizes similarity (blue). But we don’t need to parse the text caption to just “pup” or “dog” - we retain the full text.

CLIP-mod The goal of a contrastive learner is lower-dimensional representations of text (green) and images (purple) such that input pairs have maximal similarity. (cropped figure from OpenAI’s CLIP)

Wait! If the text isn’t parsed to a known set of labels, how do we determine the objective loss?

Cleverly! The key feature in training a contrastive model is to present pairs of similar inputs and pairs of dissimilar inputs. The objective function then seeks to align similar pairs together in the embedded space, and contrast dissimilar pairs further apart (a pull and a push). We have known similar pairs from the input data, and we can shuffle the text from the input pairs to create dissimilar pairs. We can also enhance the training set by modifying aspects of the known similar pairs from the training data to create additional similar pairs - manipulating the color scaling, cropping, rotating, etc.

training The basic training idea for a contrastive model - maximizing the similarity of known pairs and minimizing the similarity of dissimilar pairs. (figure by author leveraging OpenAI’s CLIP example inputs)

Let’s see a specific example with CLIP, and how it is leveraged for downstream tasks.

Carlini and Terzis use CLIP from OpenAI (paper; blog) as the contrastive model to attack. CLIP was developed to “learn visual concepts from natural language supervision”. You can try out an implementation for yourself at Colab. For a nice summary of using CLIP for zero-shot learning, see this blog.

CLIP-fig2 This is Figure 1 from OpenAI’s CLIP paper (blog version). For zero-shot learning: We see that in step (1) we input image / text pairs, where each is encoded (purple and green) to maximize the encoding similarities (blue). Then, in step (2) we create a list of possible data inputs of the form “A photo of a {object}” for many different objects. In step (3), we input a photo and select the caption with the most similar representation.

Additionally, because there is no deliberate label selection, the model isn’t constrained to the selected set of labels (i.e. 1000 labels in ImageNet). This then allows for more input data since any paired data, such as image with caption, can go into the training set with no label crafting (…but this can also provide an opening for an adversary!). The OpenAI team used a training set of 400 million image / text pairs for the full-scale CLIP model.

2 A brief description of poisoning and backdoor attacks

The key idea in a poisoning or backdoor attack on a supervised machine learning model is to introduce malicious training data that has an adversarially-desired label.

Two options a malicious actor might use to degrade a machine learning model are data poisoning and backdoor attacks. Both leverage the introduction of malicious examples into the machine learning model’s training data to shape the output in a certain way.

A poisoning attack is when an adversary introduces malicious training examples to a machine learning classifier so that it outputs an adversarially-desired label for a certain input. For example, the attacker introduce images of a certain black cat with the label “dog” so that in practice black cats are misclassified in the future.

A backdoor attack is when an adversary introduces a patch on some training examples so that the machine learning classifier will output an adversarially-desired label anytime the trigger-patch is detected. It creates a backdoor to the model so that anytime the attacker wants to force a certain output, the attacker can insert the trigger patch on the image. For a nice overview of malicious actions, see this blog.

What makes an attack on a contrastive learning model different?

A contrastive learner is unsupervised - so if we want to attack it, there is no ability to provide an “adversarially-desired label” with a malicious training example. We can’t flip labels, or insert a backdoor on images that have our label paired with them. This means we can’t apply a traditional attack method. In fact, because the output of a contrastive learner is a representation that might be used in many different downstream tasks, it’s not immediately obvious what an attack means! Are we attempting to stop the contrastive learner from providing a good representation for zero-shot learning? incorrectly classify in a downstream classification task? something else?

Generally, we might want the contrastive learner to “misrepresent” certain input data; we want to force some inputs to the “wrong area” of the embedding space. If we do this, then hopefully, our malicious attack propagates into the downstream task.

3 Poisoning and Backdooring Contrastive Learning by Carlini and Terzis

The primary result for the authors is that by poisoning just 0.001% of the input data or applying a backdoor trigger patch to 0.005% of the data, their malicious attacks succeed.

Carlini and Terzis address both poisoning and backdoor attacks on a multi-modal contrastive learner that uses images and captions. They introduce different numbers of malicious samples in different iterations to explore how much of the training data needs to be poisoned or backdoored for a successful attack.

Poisoning experiments from the paper

A few poisoned data pairs go a long way - even with only two poisoned image / caption pairs in the 3 million examples used (from the conceptual captions dataset), the model misclassifies the targeted image 60% of the time with zero-shot learning.

poisonfig A stylized version of the poisoning attack on a contrastive learner. A certain number of malicious examples are introduced - here, images of the poodle-mix along with captions that include the adversarially intended label: plane. (figure by author using cropped parts of OpenAI’s CLIP figure 1)

For both downstream classification and zero-shot learning, the eventual output is a label - in this case, a label for an image. For a poisoning attack, the authors select an intended malicious image and a target label. They then generate many different captions for the malicious image that all include the adversarially-desired label. From this, they construct malicious training sets of various sizes.

The authors measure attack success in two ways. First, the average rank of the output label, where the adversarially-desired label has the highest value. Second, with a binary check on whether the adversarially-desired label is in the top-5 outputs. The below figure from the authors shows success measured in the second, binary way, for both zero-shot and linear probes. The below figure is the summary of this work. We see that for zero-shot learning, the probability of success increases quickly for just a handful of malicious examples. Note that each data point is the average of a number of experimental runs, each of which requires re-training an entire contrastive learner. These experiments take a lot of compute. See the paper for additional details about the experiments.

poisonfig A portion of Figure 2 from Carlini and Terzis. The probability of a successful poisoning attack as the number of poisoned samples increases.

Backdoor attack experiments from the paper

Backdoor attacks also require few samples for success. The authors also find that a random patch location provides better results than consistent patch placement.

backdoorfig A stylized version of the backdoor attack on a contrastive learner. A certain number of malicious examples are introduced with the backdoor patch along with captions that include the adversarially intended label: car. (figure by author using cropped parts of OpenAI’s CLIP figure 1)

The attack experiments are computationally expensive, as each attack evaluation requires re-training a CLIP model. In the poisoning attacks, the authors replicate the same poisoning input multiple times and re-train the model each time to compare average results. But, the authors were able to bundle multiple attacks into the same set of reproduced runs (e.g. poising image 1, image 2, and image 3 in the same model, and then re-train). With backdoor attacks, this type of combination isn’t possible, and so the authors introduce a new evaluation metric, the backdoor z-score. This measures the similarity of the embedding of a pair of backdoored images against a pair of non-backdoored images. Because different models can have very different similarity scores, the authors compute an expected similarity of random non-backdoored images and then use a z-score based on this distribution. In other words, the distribution of cosine similarities for random pairs is close to normal, and pairs of backdoored images have a different distribution of similarities. The authors’ metric measures how different the backdoored similarity is from the distribution of typical pairs. Using this metric allows the authors to run fewer experiments.

backdoorfig Figure 4 from Carlini and Terzis. The backdoor z-score for a backdoor attack, as the number of malicious samples increases, with randomly placed patches.

4 Implications for foundation models

CLIP is part of a growing set of foundation models that set the scene for widespread adoption of transfer learning, where many task-specific models are built on top of these large, compute-intensive, adaptable models. One of the aspects that makes these foundation models possible is the collection of extremely large amounts of training data. This is possible when we don’t need curated labels, and the builders of the large models can scrape widely for input data. This presents an opportunity for malicious actors to design attacks based on scattering malicious examples across the web so that the attacked-representation becomes an input to many future models.

As models start to commonly adopt a transfer learning approach, the goal of attacking a foundation model grows for malicious actors. Carlini and Terzis present one method for attacking a contrastive learner that demonstrates the utility of these attacks, and the low volume of malicious examples needed.

References