Reinforced Multi-Teacher Selection for Knowledge Distillation

被引：0

作者：

Yuan, Fei ^{[1
,2
]}

Shou, Linjun ^{[2
]}

Pei, Jian ^{[3
]}

Lin, Wutao ^{[2
]}

Gong, Ming ^{[2
]}

Fu, Yan ^{[1
]}

Jiang, Daxin ^{[2
]}

机构：

[1] Univ Elect Sci & Technol China, Chengdu, Peoples R China

[2] Microsoft STCA NLP Grp, Beijing, Peoples R China

[3] Simon Fraser Univ, Sch Comp Sci, Burnaby, BC, Canada

来源：

THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2021年 / 35卷

基金：

加拿大自然科学与工程研究理事会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage remain the bottleneck of applying pre-trained deep models in production. As a popular method for model compression, knowledge distillation transfers knowledge from one or multiple large (teacher) models to a small (student) model. When multiple teacher models are available in distillation, the state-of-the-art methods assign a fixed weight to a teacher model in the whole distillation. Furthermore, most of the existing methods allocate an equal weight to every teacher model. In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled. We systematically develop a reinforced method to dynamically assign weights to teacher models for different training instances and optimize the performance of student model. Our extensive experimental results on several NLP tasks clearly verify the feasibility and effectiveness of our approach.

引用

页码：14284 / 14291

页数：8

共 50 条

[1] Correlation Guided Multi-teacher Knowledge Distillation
Shi, Luyao
Jiang, Ning
Tang, Jialiang
Huang, Xinlei
[J]. NEURAL INFORMATION PROCESSING, ICONIP 2023, PT IV, 2024, 14450 : 562 - 574
[2] Knowledge Distillation via Multi-Teacher Feature Ensemble
Ye, Xin
Jiang, Rongxin
Tian, Xiang
Zhang, Rui
Chen, Yaowu
[J]. IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 566 - 570
[3] CONFIDENCE-AWARE MULTI-TEACHER KNOWLEDGE DISTILLATION
Zhang, Hailin
Chen, Defang
Wang, Can
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4498 - 4502
[4] Adaptive multi-teacher multi-level knowledge distillation
Liu, Yuang
Zhang, Wei
Wang, Jun
[J]. NEUROCOMPUTING, 2020, 415 : 106 - 113
[5] Adaptive multi-teacher multi-level knowledge distillation
Liu, Yuang
Zhang, Wei
Wang, Jun
[J]. Neurocomputing, 2021, 415 : 106 - 113
[6] Knowledge Distillation via Multi-Teacher Feature Ensemble
Ye, Xin
Jiang, Rongxin
Tian, Xiang
Zhang, Rui
Chen, Yaowu
[J]. IEEE Signal Processing Letters, 2024, 31 : 566 - 570
[7] Decoupled Multi-teacher Knowledge Distillation based on Entropy
Cheng, Xin
Tang, Jialiang
Zhang, Zhiqiang
Yu, Wenxin
Jiang, Ning
Zhou, Jinjia
[J]. 2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
[8] Anomaly detection based on multi-teacher knowledge distillation
Ma, Ye
Jiang, Xu
Guan, Nan
Yi, Wang
[J]. JOURNAL OF SYSTEMS ARCHITECTURE, 2023, 138
[9] Robust Semantic Segmentation With Multi-Teacher Knowledge Distillation
Amirkhani, Abdollah
Khosravian, Amir
Masih-Tehrani, Masoud
Kashiani, Hossein
[J]. IEEE ACCESS, 2021, 9 : 119049 - 119066
[10] Adaptive Multi-Teacher Knowledge Distillation with Meta-Learning
Zhang, Hailin
Chen, Defang
Wang, Can
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1943 - 1948

← 1 2 3 4 5 →