Don't Retrain, Just Rewrite: Countering Adversarial Perturbations by Rewriting Text

被引:0
|
作者
Gupta, Ashim [1 ]
Blum, Carter Wood [2 ]
Choji, Temma [2 ]
Fei, Yingjie [2 ]
Shah, Shalin [2 ]
Vempala, Alakananda [2 ]
Srikumar, Vivek [1 ]
机构
[1] Univ Utah, Salt Lake City, UT 84112 USA
[2] Bloomberg, New York, NY USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Can language models transform inputs to protect text classifiers against adversarial attacks? In this work, we present ATINTER, a model that intercepts and learns to rewrite adversarial inputs to make them non-adversarial for a downstream text classifier. Our experiments on four datasets and five attack mechanisms reveal that ATINTER is effective at providing better adversarial robustness than existing defense approaches, without compromising task accuracy. For example, on sentiment classification using the SST-2 dataset, our method improves the adversarial accuracy over the best existing defense approach by more than 4% with a smaller decrease in task accuracy (0.5 % vs. 2.5%). Moreover, we show that ATINTER generalizes across multiple downstream tasks and classifiers without having to explicitly retrain it for those settings. For example, we find that when ATINTER is trained to remove adversarial perturbations for the sentiment classification task on the SST-2 dataset, it even transfers to a semantically different task of news classification (on AGNews) and improves the adversarial robustness by more than 10%.
引用
收藏
页码:13981 / 13998
页数:18
相关论文
共 4 条
  • [1] 'People just don't care': practices of text messaging in the presence of others
    Cahir, Jayde
    [J]. MEDIA CULTURE & SOCIETY, 2015, 37 (05) : 703 - 719
  • [2] Don't Search for a Search Method - Simple Heuristics Suffice for Adversarial Text Attacks
    Berger, Nathaniel
    Riezler, Stefan
    Sokolov, Artem
    Ebert, Sebastian
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 8216 - 8224
  • [3] You Just Don't Cross Them: Adolescent Interactions with Text Across Discourses
    Charles, Anita S.
    [J]. EDUCATIONAL FORUM, 2012, 76 (04): : 448 - 463
  • [4] Don't sweat the small stuff, classify the rest: Sample Shielding to protect text classifiers against adversarial attacks
    Rusert, Jonathan
    Srinivasan, Padmini
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 2716 - 2725