Paper URL: https://arxiv.org/pdf/2501.09446
Code URL: https://doublevisualdefense.github.io/
TL;DR
This paper aims to enhance the robustness of vision-language models against adversarial visual perturbations. It proposes Double Visual Defense, a large-scale adversarial vision-language pre-training method. This method contains two stages: Adversarial Contrastive Pre-Training and Adversarial Visual Instruction-Tuning. In experiments, it showcases robustness improvement, stronger zero-shot recognition capability, fewer hallucinations and superior reasoning performance.
Intro
Traditional adversarial robustness methods focus more on post-hoc fine-tuning. Double Visual Defense, in contrast, propose to both CLIP pre-training and visual instruction tuning.
Method
Adversarial Contrastive Pre-Training
∆CLIP is trained to predict the right image-text pairings given adversarial images that are optimized to fool the model into predicting incorrect image-text pairings.
Adversarial Visual Instruction-Tuning
Q&A
- Is adversarial robustness the same thing as domain robust? I’ve done research on test-time prompt tuning to enhance zero-shot and cross-domain generalization of vision-language models. Are they the same? If not, what‘s the difference between cross-domain, out-of-distribution and adversarial robustness?