Extract then Distill: Efficient and Effective Task-Agnostic BERT Distillation

被引：1

作者：

Chen, Cheng ^{[1
]}

Yin, Yichun ^{[2
]}

Shang, Lifeng ^{[2
]}

Wang, Zhi ^{[3
,4
]}

Jiang, Xin ^{[2
]}

Chen, Xiao ^{[2
]}

Liu, Qun ^{[2
]}

机构：

[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China

[2] Huawei Noahs Ark Lab, Shenzhen, Peoples R China

[3] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen, Peoples R China

[4] Peng Cheng Lab, Shenzhen, Peoples R China

来源：

ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT III | 2021年 / 12893卷

关键词：

BERT; Knowledge distillation; Structured pruning;

D O I：

10.1007/978-3-030-86365-4_46

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Task-agnostic knowledge distillation, a teacher-student framework, has been proved effective for BERT compression. Although achieving promising results on NLP tasks, it requires enormous computational resources. In this paper, we propose Extract Then Distill (ETD), a generic and flexible strategy to reuse the teacher's parameters for efficient and effective task-agnostic distillation, which can be applied to students of any size. Specifically, we introduce two variants of ETD, ETD-R and and ETD-Impt, which extract the teacher's parameters in a random manner and by following an importance metric, respectively. In this way, the student has already acquired some knowledge at the beginning of the distillation process, which makes the distillation process converge faster. We demonstrate the effectiveness of ETD on the GLUE benchmark and SQuAD. The experimental results show that: (1) compared with the baseline without an ETD strategy, ETD can save 70% of computation cost. Moreover, it achieves better results than the baseline when using the same computing resource. (2) ETD is generic and has been proven effective for different distillation methods (e.g., TinyBERT and MiniLM) and students of different sizes. Code is available at https://github.com/huawei-noah/Pretrained-Language-Model.

引用

页码：570 / 581

页数：12

共 50 条

[1] Improving task-agnostic BERT distillation with layer mapping search q
Jiao, Xiaoqi
Chang, Huating
Yin, Yichun
Shang, Lifeng
Jiang, Xin
Chen, Xiao
Li, Linlin
Wang, Fang
Liu, Qun
NEUROCOMPUTING, 2021, 461 : 194 - 203
[2] LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
Pan, Zhuoshi
Wu, Qianhui
Jiang, Huiqiang
Xia, Menglin
Luo, Xufang
Zhang, Jue
Lin, Qingwei
Ruhle, Victor
Yang, Yuqing
Lin, Chin-Yew
Zhao, H. Vicky
Qiu, Lili
Zhang, Dongmei
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 963 - 981
[3] MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
Sun, Zhiqing
Yu, Hongkun
Song, Xiaodan
Liu, Renjie
Yang, Yiming
Zhou, Denny
58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 2158 - 2170
[4] To BERT or Not to BERT: Comparing Task-specific and Task-agnostic Semi-Supervised Approaches for Sequence Tagging
Bhattacharjee, Kasturi
Ballesteros, Miguel
Anubhai, Rishita
Muresan, Smaranda
Ma, Jie
Ladhak, Faisal
Al-Onaizan, Yaser
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 7927 - 7934
[5] Continual deep reinforcement learning with task-agnostic policy distillation
Hafez, Muhammad Burhan
Erekmen, Kerim
SCIENTIFIC REPORTS, 2024, 14 (01):
[6] Continual Deep Reinforcement Learning with Task-Agnostic Policy Distillation
Hafez, Muhammad Burhan
Erekmen, Kerim
arXiv,
[7] NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural Architecture Search
Xu, Jin
Tan, Xu
Luo, Renqian
Song, Kaitao
Li, Jian
Qin, Tao
Liu, Tie-Yan
KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 1933 - 1943
[8] TADA: Efficient Task-Agnostic Domain Adaptation for Transformers
Hung, Chia-Chien
Lange, Lukas
Stroetgen, Jannik
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 487 - 503
[9] Towards a Task-agnostic Distillation Methodology for Creating Edge Foundation Models
Dey, Swarnava
Mukherjee, Arijit
Ukil, Arijit
Pal, Arpan
PROCEEDINGS OF THE 2024 WORKSHOP ON EDGE AND MOBILE FOUNDATION MODELS, EDGEFM 2024, 2024, : 10 - 15
[10] Task-Agnostic Graph Explanations
Xie, Yaochen
Katariya, Sumeet
Tang, Xianfeng
Huang, Edward
Rao, Nikhil
Subbian, Karthik
Ji, Shuiwang
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,

← 1 2 3 4 5 →