Adversarial Data Augmentation for Task-Specific Knowledge Distillation of Pre-trained Transformers

被引：0

作者：

Zhng, Minjia ^{[1
]}

Naresh, Niranjan Uma ^{[1
]}

He, Yuxiong ^{[1
]}

机构：

[1] Microsoft Corp, Bellevue, WA 98004 USA

来源：

THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Deep and large pre-trained language models (e.g., BERT, GPT-3) are state-of-the-art for various natural language processing tasks. However, the huge size of these models brings challenges to fine-tuning and online deployment due to latency and cost constraints. Existing knowledge distillation methods reduce the model size, but they may encounter difficulties transferring knowledge from the teacher model to the student model due to the limited data from the downstream tasks. In this work, we propose AD(2), a novel and effective data augmentation approach to improving the task-specific knowledge transfer when compressing large pre-trained transformer models. Different from prior methods, AD(2) performs distillation by using an enhanced training set that contains both original inputs and adversarially perturbed samples that mimic the output distribution from the teacher. Experimental results show that this method allows better transfer of knowledge from the teacher to the student during distillation, producing student models that retain 99.6% accuracy of the teacher model while outperforming existing task-specific knowledge distillation baselines by 1.2 points on average over a variety of natural language understanding tasks. Moreover, compared with alternative data augmentation methods, such as text-editing-based approaches, AD(2) is up to 28 times faster while achieving comparable or higher accuracy. In addition, when AD(2) is combined with more advanced task-agnostic distillation, we can advance the state-of-the-art performance even more. On top of the encouraging performance, this paper also provides thorough ablation studies and analysis. The discovered interplay between KD and adversarial data augmentation for compressing pre-trained Transformers may further inspire more advanced KD algorithms for compressing even larger scale models.

引用

页码：11685 / 11693

页数：9

共 50 条

[1] Are Pre-trained Convolutions Better than Pre-trained Transformers?
Tay, Yi
Dehghani, Mostafa
Gupta, Jai
Aribandi, Vamsi
Bahri, Dara
Qin, Zhen
Metzler, Donald
[J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 4349 - 4359
[2] Dynamic Knowledge Distillation for Pre-trained Language Models
Li, Lei
Lin, Yankai
Ren, Shuhuai
Li, Peng
Zhou, Jie
Sun, Xu
[J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 379 - 389
[3] MINILM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
Wang, Wenhui
Wei, Furu
Dong, Li
Bao, Hangbo
Yang, Nan
Zhou, Ming
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[4] Calibration of Pre-trained Transformers
Desai, Shrey
Durrett, Greg
[J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 295 - 302
[5] Pre-trained Adversarial Perturbations
Ban, Yuanhao
Dong, Yinpeng
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[6] AdaDS: Adaptive data selection for accelerating pre-trained language model knowledge distillation
Zhou, Qinhong
Li, Peng
Liu, Yang
Guan, Yuyang
Xing, Qizhou
Chen, Ming
Sun, Maosong
Liu, Yang
[J]. AI OPEN, 2023, 4 : 56 - 63
[7] EAPT: An encrypted traffic classification model via adversarial pre-trained transformers
Zhan, Mingming
Yang, Jin
Jia, Dongqing
Fu, Geyuan
[J]. Computer Networks, 2025, 257
[8] Pre-trained transformers: an empirical comparison
Casola, Silvia
Lauriola, Ivano
Lavelli, Alberto
[J]. MACHINE LEARNING WITH APPLICATIONS, 2022, 9
[9] Ship Classification in SAR Imagery by Shallow CNN Pre-Trained on Task-Specific Dataset with Feature Refinement
Lang, Haitao
Wang, Ruifu
Zheng, Shaoying
Wu, Siwen
Li, Jialu
[J]. REMOTE SENSING, 2022, 14 (23)
[10] SHUFFLECOUNT: TASK-SPECIFIC KNOWLEDGE DISTILLATION FOR CROWD COUNTING
Jiang, Minyang
Lin, Jianzhe
Wang, Z. Jane
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 999 - 1003

← 1 2 3 4 5 →