Exploring All-In-One Knowledge Distillation Framework for Neural Machine Translation

被引:0
|
作者
Miao, Zhongjian [1 ,2 ]
Zhang, Wen [2 ]
Su, Jinsong [1 ]
Li, Xiang [2 ]
Luan, Jian [2 ]
Chen, Yidong [1 ]
Wang, Bin [2 ]
Zhang, Min [3 ]
机构
[1] Xiamen Univ, Sch Informat, Xiamen, Peoples R China
[2] Xiaomi AI Lab, Beijing, Peoples R China
[3] Soochow Univ, Inst Comp Sci & Technol, Suzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Conventional knowledge distillation (KD) approaches are commonly employed to compress neural machine translation (NMT) models. However, they only obtain one lightweight student each time. Consequently, we have to conduct KD multiple times when different students are required at the same time, which could be resource-intensive. Additionally, these students are individually optimized, and thus lack interactions with each other, leading to their potential not being fully exerted. In this work, we propose a novel All-In-One Knowledge Distillation (AIO-KD) framework for NMT, which generates multiple satisfactory students at once. Under AIO-KD, we first randomly extract fewer-layer subnetworks from the teacher as the sample students. Then, we jointly optimize the teacher and these students, where the students simultaneously learn the knowledge from the teacher and interact with other students via mutual learning. When utilized, we re-extract the candidate students, satisfying the specifications of various devices. Particularly, we adopt carefully-designed strategies for AIO-KD: 1) we dynamically detach gradients to prevent poorly-performed students from negatively affecting the teacher during the knowledge transfer, which could subsequently impact other students; 2) we design a two-stage mutual learning strategy, which alleviates the negative impacts of poorly-performed students on the early-stage student interactions. Extensive experiments and in-depth analyses on three benchmarks demonstrate the effectiveness and eco-friendliness of AIO-KD. Our source code is available at https://github.com/DeepLearnXMU/AIO-KD.
引用
收藏
页码:2929 / 2940
页数:12
相关论文
共 50 条
  • [1] Dual Knowledge Distillation for neural machine translation
    Wan, Yuxian
    Zhang, Wenlin
    Li, Zhen
    Zhang, Hao
    Li, Yanxia
    COMPUTER SPEECH AND LANGUAGE, 2024, 84
  • [2] Continual Knowledge Distillation for Neural Machine Translation
    Zhang, Yuanchi
    Li, Peng
    Sun, Maosong
    Liu, Yang
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 7978 - 7996
  • [3] Knowledge Distillation for Multilingual Unsupervised Neural Machine Translation
    Sun, Haipeng
    Wang, Rui
    Chen, Kehai
    Utiyama, Masao
    Sumita, Eiichiro
    Zhao, Tiejun
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3525 - 3535
  • [4] Nearest Neighbor Knowledge Distillation for Neural Machine Translation
    Yang, Zhixian
    Sun, Renliang
    Wan, Xiaojun
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5546 - 5556
  • [5] Dual knowledge distillation for bidirectional neural machine translation
    Zhang, Huaao
    Qiu, Shigui
    Wu, Shilong
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [6] Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation
    Zhang, Songming
    Liang, Yunlong
    Wang, Shuaibo
    Chen, Yufeng
    Han, Wenjuan
    Liu, Jian
    Xu, Jinan
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 8062 - 8079
  • [7] Future-Aware Knowledge Distillation for Neural Machine Translation
    Zhang, Biao
    Xiong, Deyi
    Su, Jinsong
    Luo, Jiebo
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (12) : 2278 - 2287
  • [8] Knowledge Distillation: A Method for Making Neural Machine Translation More Efficient
    Jooste, Wandri
    Haque, Rejwanul
    Way, Andy
    INFORMATION, 2022, 13 (02)
  • [9] Selective Knowledge Distillation for Non-Autoregressive Neural Machine Translation
    Liu, Min
    Bao, Yu
    Zhao, Chengqi
    Huang, Shujian
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13246 - 13254
  • [10] All-in-one implementation framework for binary heaps
    Katajainen, Jyrki
    SOFTWARE-PRACTICE & EXPERIENCE, 2017, 47 (04): : 523 - 558