VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models

被引:0
|
作者
Yin, Ziyi [1 ]
Ye, Muchao [1 ]
Zhang, Tianrong [1 ]
Du, Tianyu [2 ]
Zhu, Jinguo [3 ]
Liu, Han [4 ]
Chen, Jinghui [1 ]
Wang, Ting [5 ]
Ma, Fenglong [1 ]
机构
[1] Penn State Univ, University Pk, PA 16802 USA
[2] Zhejiang Univ, Hangzhou, Peoples R China
[3] Xi An Jiao Tong Univ, Xian, Peoples R China
[4] Dalian Univ Technol, Dalian, Peoples R China
[5] SUNY Stony Brook, Stony Brook, NY USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. However, the adversarial robustness of such models has not been fully explored. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting, which is unrealistic. In this paper, we aim to investigate a new yet practical task to craft image and text perturbations using pre-trained VL models to attack black-box fine-tuned models on different downstream tasks. Towards this end, we propose VLATTACK(2) to generate adversarial samples by fusing perturbations of images and texts from both single-modal and multimodal levels. At the single-modal level, we propose a new blockwise similarity attack (BSA) strategy to learn image perturbations for disrupting universal representations. Besides, we adopt an existing text attack strategy to generate text perturbations independent of the image-modal attack. At the multimodal level, we design a novel iterative cross-search attack (ICSA) method to update adversarial image-text pairs periodically, starting with the outputs from the single-modal level. We conduct extensive experiments to attack five widely-used VL pre-trained models for six tasks. Experimental results show that VLATTACK achieves the highest attack success rates on all tasks compared with state-of-the-art baselines, which reveals a blind spot in the deployment of pre-trained VL models.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] Universal Adversarial Perturbations for Vision-Language Pre-trained Models
    Zhang, Peng-Fei
    Huang, Zi
    Bai, Guangdong
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 862 - 871
  • [2] Multimodal Search on Iconclass using Vision-Language Pre-Trained Models
    Santini, Cristian
    Posthumus, Etienne
    Tietz, Tabea
    Tan, Mary Ann
    Bruns, Oleksandra
    Sack, Harald
    2023 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, JCDL, 2023, : 285 - 287
  • [3] CPT: Colorful Prompt Tuning for pre-trained vision-language models
    Yao, Yuan
    Zhang, Ao
    Zhang, Zhengyan
    Liu, Zhiyuan
    Chua, Tat-Seng
    Sun, Maosong
    AI OPEN, 2024, 5 : 30 - 38
  • [4] p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models
    Wu, Haoyuan
    Zhang, Xinyun
    Xu, Peng
    Liao, Peiyu
    Yao, Xufeng
    Yu, Bei
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 6003 - 6011
  • [5] Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models
    Wu, Qiong
    Yu, Wei
    Zhou, Yiyi
    Huang, Shubin
    Sun, Xiaoshuai
    Ji, Rongrong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [6] Robotic Applications of Pre-Trained Vision-Language Models to Various Recognition Behaviors
    Kawaharazuka, Kento
    Obinata, Yoshiki
    Kanazawa, Naoaki
    Okada, Kei
    Inaba, Masayuki
    2023 IEEE-RAS 22ND INTERNATIONAL CONFERENCE ON HUMANOID ROBOTS, HUMANOIDS, 2023,
  • [7] Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis
    An, Jieyu
    Zainon, Wan Mohd Nazmee Wan
    Ding, Binfen
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 37 (02): : 1673 - 1689
  • [8] Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models
    Zheng, Kecheng
    Wu, Wei
    Feng, Ruili
    Zhu, Kai
    Liu, Jiawei
    Zhao, Deli
    Zha, Zheng-Jun
    Chen, Wei
    Shen, Yujun
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11629 - 11639
  • [9] Open-World Object Manipulation using Pre-Trained Vision-Language Models
    Stone, Austin
    Xiao, Ted
    Lu, Yao
    Gopalakrishnan, Keerthana
    Lee, Kuang-Huei
    Quan Vuong
    Wohlhart, Paul
    Kirmani, Sean
    Zitkovich, Brianna
    Xia, Fei
    Finn, Chelsea
    Hausman, Karol
    CONFERENCE ON ROBOT LEARNING, VOL 229, 2023, 229
  • [10] Harnessing the Power of Pre-trained Vision-Language Models for Efficient Medical Report Generation
    Li, Qi
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 1308 - 1317