Multi-Teacher Distillation With Single Model for Neural Machine Translation

被引:5
|
作者
Liang, Xiaobo [1 ]
Wu, Lijun [2 ]
Li, Juntao [1 ]
Qin, Tao [2 ]
Zhang, Min [1 ]
Liu, Tie-Yan [2 ]
机构
[1] Soochow Univ, Suzhou 215006, Peoples R China
[2] Microsoft Res, Beijing 100080, Peoples R China
基金
美国国家科学基金会;
关键词
Dropout; knowledge distillation; multiple teachers; neural machine translation; sub-networks;
D O I
10.1109/TASLP.2022.3153264
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Knowledge distillation (KD) is an effective strategy for neural machine translation (NMT) to improve the performance of a student model. Usually, the teacher can guide the student to be better by distilling the soft label or data knowledge fromthe teacher itself. However, the data diversity and teacher knowledge are limited with only one teacher model. Though a natural solution is to adopt multiple randomized teachermodels, one big shortcoming is that the model parameters and training costs are largely increased with the number of teacher models. In this work, we explore to mimic multiple teacher distillation from the sub-network space and permuted variants of one single teacher model. Specifically, we train a teacher by multiple sub-network extraction paradigms: sub-layer reordering, layer-drop, and dropout variants. In doing so, one teacher model can provide multiple outputs variants and causes neither additional parameters nor much extra training cost. Experiments on 8 IWSLT datasets: IWSLT14 En <-> De, En <-> Es and IWSLT17 En <-> Fr, En <-> Zh and the large WMT14 EN -> DE translation tasks show that our method even achieves nearly comparable performance with multiple teacher models with different randomized parameters, bothword-level and sequence-level knowledge distillation. Our code is available online at https:// github.com/dropreg/ RLD.
引用
收藏
页码:992 / 1002
页数:11
相关论文
共 50 条
  • [1] A multi-graph neural group recommendation model with meta-learning and multi-teacher distillation
    Zhou, Weizhen
    Huang, Zhenhua
    Wang, Cheng
    Chen, Yunwen
    [J]. KNOWLEDGE-BASED SYSTEMS, 2023, 276
  • [2] Reinforced Multi-Teacher Selection for Knowledge Distillation
    Yuan, Fei
    Shou, Linjun
    Pei, Jian
    Lin, Wutao
    Gong, Ming
    Fu, Yan
    Jiang, Daxin
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 14284 - 14291
  • [3] Correlation Guided Multi-teacher Knowledge Distillation
    Shi, Luyao
    Jiang, Ning
    Tang, Jialiang
    Huang, Xinlei
    [J]. NEURAL INFORMATION PROCESSING, ICONIP 2023, PT IV, 2024, 14450 : 562 - 574
  • [4] Knowledge Distillation via Multi-Teacher Feature Ensemble
    Ye, Xin
    Jiang, Rongxin
    Tian, Xiang
    Zhang, Rui
    Chen, Yaowu
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 566 - 570
  • [5] CONFIDENCE-AWARE MULTI-TEACHER KNOWLEDGE DISTILLATION
    Zhang, Hailin
    Chen, Defang
    Wang, Can
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4498 - 4502
  • [6] Adaptive multi-teacher multi-level knowledge distillation
    Liu, Yuang
    Zhang, Wei
    Wang, Jun
    [J]. NEUROCOMPUTING, 2020, 415 : 106 - 113
  • [7] Adaptive multi-teacher multi-level knowledge distillation
    Liu, Yuang
    Zhang, Wei
    Wang, Jun
    [J]. Neurocomputing, 2021, 415 : 106 - 113
  • [8] Knowledge Distillation via Multi-Teacher Feature Ensemble
    Ye, Xin
    Jiang, Rongxin
    Tian, Xiang
    Zhang, Rui
    Chen, Yaowu
    [J]. IEEE Signal Processing Letters, 2024, 31 : 566 - 570
  • [9] Decoupled Multi-teacher Knowledge Distillation based on Entropy
    Cheng, Xin
    Tang, Jialiang
    Zhang, Zhiqiang
    Yu, Wenxin
    Jiang, Ning
    Zhou, Jinjia
    [J]. 2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
  • [10] Bi-Level Orthogonal Multi-Teacher Distillation
    Gong, Shuyue
    Wen, Weigang
    [J]. ELECTRONICS, 2024, 13 (16)