Extract then Distill: Efficient and Effective Task-Agnostic BERT Distillation

被引：1

作者：

Chen, Cheng ^{[1
]}

Yin, Yichun ^{[2
]}

Shang, Lifeng ^{[2
]}

Wang, Zhi ^{[3
,4
]}

Jiang, Xin ^{[2
]}

Chen, Xiao ^{[2
]}

Liu, Qun ^{[2
]}

机构：

[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China

[2] Huawei Noahs Ark Lab, Shenzhen, Peoples R China

[3] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen, Peoples R China

[4] Peng Cheng Lab, Shenzhen, Peoples R China

来源：

ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT III | 2021年 / 12893卷

关键词：

BERT; Knowledge distillation; Structured pruning;

D O I：

10.1007/978-3-030-86365-4_46

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Task-agnostic knowledge distillation, a teacher-student framework, has been proved effective for BERT compression. Although achieving promising results on NLP tasks, it requires enormous computational resources. In this paper, we propose Extract Then Distill (ETD), a generic and flexible strategy to reuse the teacher's parameters for efficient and effective task-agnostic distillation, which can be applied to students of any size. Specifically, we introduce two variants of ETD, ETD-R and and ETD-Impt, which extract the teacher's parameters in a random manner and by following an importance metric, respectively. In this way, the student has already acquired some knowledge at the beginning of the distillation process, which makes the distillation process converge faster. We demonstrate the effectiveness of ETD on the GLUE benchmark and SQuAD. The experimental results show that: (1) compared with the baseline without an ETD strategy, ETD can save 70% of computation cost. Moreover, it achieves better results than the baseline when using the same computing resource. (2) ETD is generic and has been proven effective for different distillation methods (e.g., TinyBERT and MiniLM) and students of different sizes. Code is available at https://github.com/huawei-noah/Pretrained-Language-Model.

引用

页码：570 / 581

页数：12

共 50 条

[41] CEM: Constrained Entropy Maximization for Task-Agnostic Safe Exploration
Yang, Qisong
Spaan, Matthijs T. J.
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 9, 2023, : 10798 - 10806
[42] Task-Agnostic Continual Hippocampus Segmentation for Smooth Population Shifts
Gonzalez, Camila
Ranem, Amin
Othman, Ahmed
Mukhopadhyay, Anirban
DOMAIN ADAPTATION AND REPRESENTATION TRANSFER (DART 2022), 2022, 13542 : 108 - 118
[43] FADE: Fusing the Assets of Decoder and Encoder for Task-Agnostic Upsampling
Lu, Hao
Liu, Wenze
Fu, Hongtao
Cao, Zhiguo
COMPUTER VISION - ECCV 2022, PT XXVII, 2022, 13687 : 231 - 247
[44] To Distill or Not to Distill: Toward Fast, Accurate, and Communication-Efficient Federated Distillation Learning
Zhang, Yuan
Zhang, Wenlong
Pu, Lingjun
Lin, Tao
Yan, Jinyao
IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (06) : 10040 - 10053
[45] Investigating the Effectiveness of Task-Agnostic Prefix Prompt for Instruction Following
Ye, Seonghyeon
Hwang, Hyeonbin
Yang, Sohee
Yun, Hyeongu
Kim, Yireun
Seo, Minjoon
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19386 - 19394
[46] A Self-supervised Task-agnostic Embedding for EEG Signals
Partovi, Andi
Burkitt, Anthony N.
Grayden, David
2023 11TH INTERNATIONAL IEEE/EMBS CONFERENCE ON NEURAL ENGINEERING, NER, 2023,
[47] Task-Agnostic Vision Transformer for Distributed Learning of Image Processing
Kim, Boah
Kim, Jeongsol
Ye, Jong Chul
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 203 - 218
[48] Task-Agnostic Adaptation for Safe Human-Robot Handover
Liu, Ruixuan
Chen, Rui
Liu, Changliu
IFAC PAPERSONLINE, 2022, 55 (41): : 175 - 180
[49] CodePrompt: Task-Agnostic Prefix Tuning for Program and Language Generation
Choi, YunSeok
Lee, Jee-Hyong
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 5282 - 5297
[50] Interesting Object, Curious Agent: Learning Task-Agnostic Exploration
Parisi, Simone
Dean, Victoria
Pathak, Deepak
Gupta, Abhinav
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34

← 1 2 3 4 5 →