VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models

被引：0

作者：

Yin, Ziyi ^{[1
]}

Ye, Muchao ^{[1
]}

Zhang, Tianrong ^{[1
]}

Du, Tianyu ^{[2
]}

Zhu, Jinguo ^{[3
]}

Liu, Han ^{[4
]}

Chen, Jinghui ^{[1
]}

Wang, Ting ^{[5
]}

Ma, Fenglong ^{[1
]}

机构：

[1] Penn State Univ, University Pk, PA 16802 USA

[2] Zhejiang Univ, Hangzhou, Peoples R China

[3] Xi An Jiao Tong Univ, Xian, Peoples R China

[4] Dalian Univ Technol, Dalian, Peoples R China

[5] SUNY Stony Brook, Stony Brook, NY USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. However, the adversarial robustness of such models has not been fully explored. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting, which is unrealistic. In this paper, we aim to investigate a new yet practical task to craft image and text perturbations using pre-trained VL models to attack black-box fine-tuned models on different downstream tasks. Towards this end, we propose VLATTACK(2) to generate adversarial samples by fusing perturbations of images and texts from both single-modal and multimodal levels. At the single-modal level, we propose a new blockwise similarity attack (BSA) strategy to learn image perturbations for disrupting universal representations. Besides, we adopt an existing text attack strategy to generate text perturbations independent of the image-modal attack. At the multimodal level, we design a novel iterative cross-search attack (ICSA) method to update adversarial image-text pairs periodically, starting with the outputs from the single-modal level. We conduct extensive experiments to attack five widely-used VL pre-trained models for six tasks. Experimental results show that VLATTACK achieves the highest attack success rates on all tasks compared with state-of-the-art baselines, which reveals a blind spot in the deployment of pre-trained VL models.

引用

页数：21

共 50 条

[1] Universal Adversarial Perturbations for Vision-Language Pre-trained Models
Zhang, Peng-Fei
Huang, Zi
Bai, Guangdong
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 862 - 871
[2] Multimodal Search on Iconclass using Vision-Language Pre-Trained Models
Santini, Cristian
Posthumus, Etienne
Tietz, Tabea
Tan, Mary Ann
Bruns, Oleksandra
Sack, Harald
2023 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, JCDL, 2023, : 285 - 287
[3] CPT: Colorful Prompt Tuning for pre-trained vision-language models
Yao, Yuan
Zhang, Ao
Zhang, Zhengyan
Liu, Zhiyuan
Chua, Tat-Seng
Sun, Maosong
AI OPEN, 2024, 5 : 30 - 38
[4] p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models
Wu, Haoyuan
Zhang, Xinyun
Xu, Peng
Liao, Peiyu
Yao, Xufeng
Yu, Bei
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 6003 - 6011
[5] Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models
Wu, Qiong
Yu, Wei
Zhou, Yiyi
Huang, Shubin
Sun, Xiaoshuai
Ji, Rongrong
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[6] Robotic Applications of Pre-Trained Vision-Language Models to Various Recognition Behaviors
Kawaharazuka, Kento
Obinata, Yoshiki
Kanazawa, Naoaki
Okada, Kei
Inaba, Masayuki
2023 IEEE-RAS 22ND INTERNATIONAL CONFERENCE ON HUMANOID ROBOTS, HUMANOIDS, 2023,
[7] Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis
An, Jieyu
Zainon, Wan Mohd Nazmee Wan
Ding, Binfen
INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 37 (02): : 1673 - 1689
[8] Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models
Zheng, Kecheng
Wu, Wei
Feng, Ruili
Zhu, Kai
Liu, Jiawei
Zhao, Deli
Zha, Zheng-Jun
Chen, Wei
Shen, Yujun
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11629 - 11639
[9] Open-World Object Manipulation using Pre-Trained Vision-Language Models
Stone, Austin
Xiao, Ted
Lu, Yao
Gopalakrishnan, Keerthana
Lee, Kuang-Huei
Quan Vuong
Wohlhart, Paul
Kirmani, Sean
Zitkovich, Brianna
Xia, Fei
Finn, Chelsea
Hausman, Karol
CONFERENCE ON ROBOT LEARNING, VOL 229, 2023, 229
[10] Harnessing the Power of Pre-trained Vision-Language Models for Efficient Medical Report Generation
Li, Qi
PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 1308 - 1317

← 1 2 3 4 5 →