Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction
This repository is the official implementation of Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction. [Paper]
A comparison of our SADCA and existing frameworks. (a) and (b) illustrate the core concepts of SGA and SA-AET, respectively, where only one or two static interactions are performed between the visual and textual modalities, with the interactions being limited solely to positive pairs. (c) illustrates the core idea of the proposed SADCA, which continuously disrupts cross-modal interactions through dynamic contrastive interactions with both positive and negative pairs. Additionally, it leverages a semantic augmentation strategy to enrich the data samples, thereby diversifying the semantic information. The arrow represents the interaction between the visual and textual modalities. The dotted lines represent the generation of adversarial examples from the original examples. (d) demonstrate the effectiveness of the input transformation in enhancing the adversarial attack transferability. Furthermore, we observe that using large number of iterations (LI) to attack the image modality can further improve the attack performance.
conda create -n SADCA python=3.10
conda activate SADCA
pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txtDownload the datasets, Flickr30k and MSCOCO (the annotations is provided in ./data_annotation/). Set the root path of the dataset in ./configs/Retrieval_flickr.yaml, image_root.
The checkpoints of the fine-tuned VLP models is accessible in ALBEF, TCL, CLIP.
Download the datasets from this link.
Download the ALBEF and TCL Pre-Trained models from this link to the checkpoints file.
Download the bert-base-uncased from this link
We provide eval_SADCA.py for Image-Text Retrieval Attack Evaluation,
Here is an example for Flickr30K dataset.
python eval_SADCA.py --config ./configs/Retrieval_flickr.yaml \
--cuda_id 0 \
--source_model CLIP_CNN \
--albef_ckpt ./checkpoints/albef_flickr.pth \
--tcl_ckpt ./checkpoints/tcl_flickr.pth \
--original_rank_index_path ./std_eval_idx/flickr30k/ \
--result_file_path ./flickr30k_adv/result_SADCA.txt \
--save_advimg_path ./flickr30k_adv/SADCA_CLIP_CNN/ \
--save_advimg_caption_path ./flickr30k_adv/SADCA_CLIP_CNN.jsonHere is an example for MSCOCO dataset.
python eval_SADCA.py --config ./configs/Retrieval_coco.yaml \
--cuda_id 0 \
--source_model CLIP_CNN \
--albef_ckpt ./checkpoints/albef_mscoco.pth \
--tcl_ckpt ./checkpoints/tcl_mscoco.pth \
--original_rank_index_path ./std_eval_idx/coco/ \
--result_file_path ./mscoco_adv/result_SADCA.txt \
--save_advimg_path ./mscoco_adv/SADCA_CLIP_CNN/ \
--save_advimg_caption_path ./mscoco_adv/SADCA_CLIP_CNN.jsonMain Results:
We present two cross-task attack evaluations, ITR->VG and ITR->IC.
ITR->VG:
First, please use the MSCOCO dataset and the provided files ./data_annotation/refcoco+_test_for_adv.json and ./data_annotation/refcoco+_val_for_adv.json to generate adversarial images(3K images).
After that, please refer to Grouding.py (use '--evaluate') in ALBEF, and replace the clean images in the MSCOCO dataset with the adversarial images. Then, you can get the performance of the ALBEF model on the adversarial images, corresponding to the Val, TestA, and TestB metrics.
ITR->IC:
First, please use the MSCOCO dataset and the provided files ./data_annotation/coco_karpathy_test.json and ./data_annotation/coco_karpathy_val.json to generate adversarial images(10K images).
After that, please refer to train_caption.py (use '--evaluate') in BLIP, and replace the clean images in the MSCOCO dataset with the adversarial images. Then, you can get the performance of the ALBEF model on the adversarial images, corresponding to the B@4, METEOR, ROUGE-L, CIDEr and SPICE metrics.
Main Results:
Employ the binary decision template "Does the picture depict that 'adversarial text'? Only answer Yes or No." to construct adversarial text prompts. Then combine adversarial text prompts and adversarial image to send LVLMs.
Main Results:
This project is built on SA-AET. We sincerely thank them for their outstanding work.
@article{li2026towards,
title={Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction},
author={Li, Yuanbo and Xu, Tianyang and Hu, Cong and Zhou, Tao and Wu, Xiao-Jun and Kittler, Josef},
journal={arXiv preprint arXiv:2603.04839},
year={2026}
}






