Adversarial Data Augmentation for Task-Specific Knowledge Distillation of Pre-trained Transformers

被引:0
|
作者
Zhng, Minjia [1 ]
Naresh, Niranjan Uma [1 ]
He, Yuxiong [1 ]
机构
[1] Microsoft Corp, Bellevue, WA 98004 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deep and large pre-trained language models (e.g., BERT, GPT-3) are state-of-the-art for various natural language processing tasks. However, the huge size of these models brings challenges to fine-tuning and online deployment due to latency and cost constraints. Existing knowledge distillation methods reduce the model size, but they may encounter difficulties transferring knowledge from the teacher model to the student model due to the limited data from the downstream tasks. In this work, we propose AD(2), a novel and effective data augmentation approach to improving the task-specific knowledge transfer when compressing large pre-trained transformer models. Different from prior methods, AD(2) performs distillation by using an enhanced training set that contains both original inputs and adversarially perturbed samples that mimic the output distribution from the teacher. Experimental results show that this method allows better transfer of knowledge from the teacher to the student during distillation, producing student models that retain 99.6% accuracy of the teacher model while outperforming existing task-specific knowledge distillation baselines by 1.2 points on average over a variety of natural language understanding tasks. Moreover, compared with alternative data augmentation methods, such as text-editing-based approaches, AD(2) is up to 28 times faster while achieving comparable or higher accuracy. In addition, when AD(2) is combined with more advanced task-agnostic distillation, we can advance the state-of-the-art performance even more. On top of the encouraging performance, this paper also provides thorough ablation studies and analysis. The discovered interplay between KD and adversarial data augmentation for compressing pre-trained Transformers may further inspire more advanced KD algorithms for compressing even larger scale models.
引用
收藏
页码:11685 / 11693
页数:9
相关论文
共 50 条
  • [1] Are Pre-trained Convolutions Better than Pre-trained Transformers?
    Tay, Yi
    Dehghani, Mostafa
    Gupta, Jai
    Aribandi, Vamsi
    Bahri, Dara
    Qin, Zhen
    Metzler, Donald
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 4349 - 4359
  • [2] Dynamic Knowledge Distillation for Pre-trained Language Models
    Li, Lei
    Lin, Yankai
    Ren, Shuhuai
    Li, Peng
    Zhou, Jie
    Sun, Xu
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 379 - 389
  • [3] MINILM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
    Wang, Wenhui
    Wei, Furu
    Dong, Li
    Bao, Hangbo
    Yang, Nan
    Zhou, Ming
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [4] Calibration of Pre-trained Transformers
    Desai, Shrey
    Durrett, Greg
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 295 - 302
  • [5] Pre-trained Adversarial Perturbations
    Ban, Yuanhao
    Dong, Yinpeng
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [6] AdaDS: Adaptive data selection for accelerating pre-trained language model knowledge distillation
    Zhou, Qinhong
    Li, Peng
    Liu, Yang
    Guan, Yuyang
    Xing, Qizhou
    Chen, Ming
    Sun, Maosong
    Liu, Yang
    [J]. AI OPEN, 2023, 4 : 56 - 63
  • [7] EAPT: An encrypted traffic classification model via adversarial pre-trained transformers
    Zhan, Mingming
    Yang, Jin
    Jia, Dongqing
    Fu, Geyuan
    [J]. Computer Networks, 2025, 257
  • [8] Pre-trained transformers: an empirical comparison
    Casola, Silvia
    Lauriola, Ivano
    Lavelli, Alberto
    [J]. MACHINE LEARNING WITH APPLICATIONS, 2022, 9
  • [9] Ship Classification in SAR Imagery by Shallow CNN Pre-Trained on Task-Specific Dataset with Feature Refinement
    Lang, Haitao
    Wang, Ruifu
    Zheng, Shaoying
    Wu, Siwen
    Li, Jialu
    [J]. REMOTE SENSING, 2022, 14 (23)
  • [10] SHUFFLECOUNT: TASK-SPECIFIC KNOWLEDGE DISTILLATION FOR CROWD COUNTING
    Jiang, Minyang
    Lin, Jianzhe
    Wang, Z. Jane
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 999 - 1003