Extract then Distill: Efficient and Effective Task-Agnostic BERT Distillation

被引:1
|
作者
Chen, Cheng [1 ]
Yin, Yichun [2 ]
Shang, Lifeng [2 ]
Wang, Zhi [3 ,4 ]
Jiang, Xin [2 ]
Chen, Xiao [2 ]
Liu, Qun [2 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
[2] Huawei Noahs Ark Lab, Shenzhen, Peoples R China
[3] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[4] Peng Cheng Lab, Shenzhen, Peoples R China
关键词
BERT; Knowledge distillation; Structured pruning;
D O I
10.1007/978-3-030-86365-4_46
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Task-agnostic knowledge distillation, a teacher-student framework, has been proved effective for BERT compression. Although achieving promising results on NLP tasks, it requires enormous computational resources. In this paper, we propose Extract Then Distill (ETD), a generic and flexible strategy to reuse the teacher's parameters for efficient and effective task-agnostic distillation, which can be applied to students of any size. Specifically, we introduce two variants of ETD, ETD-R and and ETD-Impt, which extract the teacher's parameters in a random manner and by following an importance metric, respectively. In this way, the student has already acquired some knowledge at the beginning of the distillation process, which makes the distillation process converge faster. We demonstrate the effectiveness of ETD on the GLUE benchmark and SQuAD. The experimental results show that: (1) compared with the baseline without an ETD strategy, ETD can save 70% of computation cost. Moreover, it achieves better results than the baseline when using the same computing resource. (2) ETD is generic and has been proven effective for different distillation methods (e.g., TinyBERT and MiniLM) and students of different sizes. Code is available at https://github.com/huawei-noah/Pretrained-Language-Model.
引用
收藏
页码:570 / 581
页数:12
相关论文
共 50 条
  • [1] Improving task-agnostic BERT distillation with layer mapping search q
    Jiao, Xiaoqi
    Chang, Huating
    Yin, Yichun
    Shang, Lifeng
    Jiang, Xin
    Chen, Xiao
    Li, Linlin
    Wang, Fang
    Liu, Qun
    NEUROCOMPUTING, 2021, 461 : 194 - 203
  • [2] LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
    Pan, Zhuoshi
    Wu, Qianhui
    Jiang, Huiqiang
    Xia, Menglin
    Luo, Xufang
    Zhang, Jue
    Lin, Qingwei
    Ruhle, Victor
    Yang, Yuqing
    Lin, Chin-Yew
    Zhao, H. Vicky
    Qiu, Lili
    Zhang, Dongmei
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 963 - 981
  • [3] MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
    Sun, Zhiqing
    Yu, Hongkun
    Song, Xiaodan
    Liu, Renjie
    Yang, Yiming
    Zhou, Denny
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 2158 - 2170
  • [4] To BERT or Not to BERT: Comparing Task-specific and Task-agnostic Semi-Supervised Approaches for Sequence Tagging
    Bhattacharjee, Kasturi
    Ballesteros, Miguel
    Anubhai, Rishita
    Muresan, Smaranda
    Ma, Jie
    Ladhak, Faisal
    Al-Onaizan, Yaser
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 7927 - 7934
  • [5] Continual deep reinforcement learning with task-agnostic policy distillation
    Hafez, Muhammad Burhan
    Erekmen, Kerim
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [6] Continual Deep Reinforcement Learning with Task-Agnostic Policy Distillation
    Hafez, Muhammad Burhan
    Erekmen, Kerim
    arXiv,
  • [7] NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural Architecture Search
    Xu, Jin
    Tan, Xu
    Luo, Renqian
    Song, Kaitao
    Li, Jian
    Qin, Tao
    Liu, Tie-Yan
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 1933 - 1943
  • [8] TADA: Efficient Task-Agnostic Domain Adaptation for Transformers
    Hung, Chia-Chien
    Lange, Lukas
    Stroetgen, Jannik
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 487 - 503
  • [9] Towards a Task-agnostic Distillation Methodology for Creating Edge Foundation Models
    Dey, Swarnava
    Mukherjee, Arijit
    Ukil, Arijit
    Pal, Arpan
    PROCEEDINGS OF THE 2024 WORKSHOP ON EDGE AND MOBILE FOUNDATION MODELS, EDGEFM 2024, 2024, : 10 - 15
  • [10] Task-Agnostic Graph Explanations
    Xie, Yaochen
    Katariya, Sumeet
    Tang, Xianfeng
    Huang, Edward
    Rao, Nikhil
    Subbian, Karthik
    Ji, Shuiwang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,