Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness

Paper URL: https://arxiv.org/pdf/2501.09446
Code URL: https://doublevisualdefense.github.io/

TL;DR

This paper aims to enhance the robustness of vision-language models against adversarial visual perturbations. It proposes Double Visual Defense, a large-scale adversarial vision-language pre-training method. This method contains two stages: Adversarial Contrastive Pre-Training and Adversarial Visual Instruction-Tuning. In experiments, it showcases robustness improvement, stronger zero-shot recognition capability, fewer hallucinations and superior reasoning performance.

Intro

Traditional adversarial robustness methods focus more on post-hoc fine-tuning. Double Visual Defense, in contrast, propose to both CLIP pre-training and visual instruction tuning.

Method

1

Adversarial Contrastive Pre-Training

∆CLIP is trained to predict the right image-text pairings given adversarial images that are optimized to fool the model into predicting incorrect image-text pairings.

Adversarial Visual Instruction-Tuning

Q&A

  1. Is adversarial robustness the same thing as domain robust? I’ve done research on test-time prompt tuning to enhance zero-shot and cross-domain generalization of vision-language models. Are they the same? If not, what‘s the difference between cross-domain, out-of-distribution and adversarial robustness?