Rethinking Textual Adversarial Defense for Pre-Trained Language Models

被引:7
|
作者
Wang, Jiayi [1 ,2 ,3 ]
Bao, Rongzhou [4 ]
Zhang, Zhuosheng [1 ,2 ,3 ]
Zhao, Hai [1 ,2 ,3 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai 200240, Peoples R China
[2] Shanghai Jiao Tong Univ, Key Lab Shanghai Educ Commiss Intelligent Interac, Shanghai 200240, Peoples R China
[3] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, Shanghai 200240, Peoples R China
[4] Ant Grp, Hangzhou 310000, Peoples R China
基金
中国国家自然科学基金;
关键词
Detectors; Perturbation methods; Robustness; Speech processing; Adaptation models; Predictive models; Computer vision; Adversarial attack; adversarial defense; pre-trained language models; ATTACKS;
D O I
10.1109/TASLP.2022.3192097
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Although pre-trained language models (PrLMs) have achieved significant success, recent studies demonstrate that PrLMs are vulnerable to adversarial attacks. By generating adversarial examples with slight perturbations on different levels (sentence / word / character), adversarial attacks can fool PrLMs to generate incorrect predictions, which questions the robustness of PrLMs. However, we find that most existing textual adversarial examples are unnatural, which can be easily distinguished by both human and machine. Based on a general anomaly detector, we propose a novel metric (Degree of Anomaly) as a constraint to enable current adversarial attack approaches to generate more natural and imperceptible adversarial examples. Under this new constraint, the success rate of existing attacks drastically decreases, which reveals that the robustness of PrLMs is not as fragile as they claimed. In addition, we find that four types of randomization can invalidate a large portion of textual adversarial examples. Based on anomaly detector and randomization, we design a universal defense framework, which is among the first to perform textual adversarial defense without knowing the specific attack. Empirical results show that our universal defense framework achieves comparable or even higher after-attack accuracy with other specific defenses, while preserving higher original accuracy at the same time. Our work discloses the essence of textual adversarial attacks, and indicates that (i) further works of adversarial attacks should focus more on how to overcome the detection and resist the randomization, otherwise their adversarial examples would be easily detected and invalidated; and (ii) compared with the unnatural and perceptible adversarial examples, it is those undetectable adversarial examples that pose real risks for PrLMs and require more attention for future robustness-enhancing strategies.
引用
下载
收藏
页码:2526 / 2540
页数:15
相关论文
共 50 条
  • [1] Universal Adversarial Perturbations for Vision-Language Pre-trained Models
    Zhang, Peng-Fei
    Huang, Zi
    Bai, Guangdong
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 862 - 871
  • [2] Pre-Trained Language Models and Their Applications
    Wang, Haifeng
    Li, Jiwei
    Wu, Hua
    Hovy, Eduard
    Sun, Yu
    ENGINEERING, 2023, 25 : 51 - 65
  • [3] Universal adversarial defense in remote sensing based on pre-trained denoising diffusion models
    Yu, Weikang
    Xu, Yonghao
    Ghamisi, Pedram
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 133
  • [4] Efficient Key-Based Adversarial Defense for ImageNet by Using Pre-Trained Models
    Maungmaung, Aprilpyone
    Echizen, Isao
    Kiya, Hitoshi
    IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2024, 5 : 902 - 913
  • [5] Can pre-trained language models be used to resolve textual and semantic merge conflicts?
    Zhang, Jialu
    Mytkowicz, Todd
    Kaufman, Mike
    Piskac, Ruzica
    Lahiri, Shuvendu K.
    arXiv, 2021,
  • [6] CodeAttack: Code-Based Adversarial Attacks for Pre-trained Programming Language Models
    Jha, Akshita
    Reddy, Chandan K.
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 12, 2023, : 14892 - 14900
  • [7] Pre-trained Adversarial Perturbations
    Ban, Yuanhao
    Dong, Yinpeng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [8] Annotating Columns with Pre-trained Language Models
    Suhara, Yoshihiko
    Li, Jinfeng
    Li, Yuliang
    Zhang, Dan
    Demiralp, Cagatay
    Chen, Chen
    Tan, Wang-Chiew
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22), 2022, : 1493 - 1503
  • [9] LaoPLM: Pre-trained Language Models for Lao
    Lin, Nankai
    Fu, Yingwen
    Yang, Ziyu
    Chen, Chuwei
    Jiang, Shengyi
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6506 - 6512
  • [10] PhoBERT: Pre-trained language models for Vietnamese
    Dat Quoc Nguyen
    Anh Tuan Nguyen
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1037 - 1042