Improved Knowledge Distillation via Teacher Assistant

被引:0
|
作者
Mirzadeh, Seyed Iman [1 ]
Farajtabar, Mehrdad [2 ]
Li, Ang [2 ]
Levine, Nir [2 ]
Matsukawa, Akihiro [3 ]
Ghasemzadeh, Hassan [1 ]
机构
[1] Washington State Univ, Pullman, WA 99164 USA
[2] DeepMind, Mountain View, CA USA
[3] DE Shaw, New York, NY USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too large to be deployed on edge devices like smartphones or embedded sensor nodes. There have been efforts to compress these networks, and a popular method is knowledge distillation, where a large (teacher) pre-trained network is used to train a smaller (student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is large. Given a fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To alleviate this shortcoming, we introduce multi-step knowledge distillation, which employs an intermediate-sized network (teacher assistant) to bridge the gap between the student and the teacher. Moreover, we study the effect of teacher assistant size and extend the framework to multi-step distillation. Theoretical analysis and extensive experiments on CIFAR-10,100 and ImageNet datasets and on CNN and ResNet architectures substantiate the effectiveness of our proposed approach.
引用
收藏
页码:5191 / 5198
页数:8
相关论文
共 50 条
  • [21] Knowledge Distillation with a Precise Teacher and Prediction with Abstention
    Xu, Yi
    Pu, Jian
    Zhao, Hui
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9000 - 9006
  • [22] A Two-Teacher Framework for Knowledge Distillation
    Chen, Xingjian
    Su, Jianbo
    Zhang, Jun
    [J]. ADVANCES IN NEURAL NETWORKS - ISNN 2019, PT I, 2019, 11554 : 58 - 66
  • [23] Like teacher, like pupil: Transferring backdoors via feature-based knowledge distillation
    Chen, Jinyin
    Cao, Zhiqi
    Chen, Ruoxi
    Zheng, Haibin
    Li, Xiao
    Xuan, Qi
    Yang, Xing
    [J]. COMPUTERS & SECURITY, 2024, 146
  • [24] Reinforced Multi-Teacher Selection for Knowledge Distillation
    Yuan, Fei
    Shou, Linjun
    Pei, Jian
    Lin, Wutao
    Gong, Ming
    Fu, Yan
    Jiang, Daxin
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 14284 - 14291
  • [25] Temperature Annealing Knowledge Distillation from Averaged Teacher
    Gu, Xiaozhe
    Zhang, Zixun
    Luo, Tao
    [J]. 2022 IEEE 42ND INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS WORKSHOPS (ICDCSW), 2022, : 133 - 138
  • [26] Correlation Guided Multi-teacher Knowledge Distillation
    Shi, Luyao
    Jiang, Ning
    Tang, Jialiang
    Huang, Xinlei
    [J]. NEURAL INFORMATION PROCESSING, ICONIP 2023, PT IV, 2024, 14450 : 562 - 574
  • [27] Knowledge Distillation via Information Matching
    Zhu, Honglin
    Jiang, Ning
    Tang, Jialiang
    Huang, Xinlei
    [J]. NEURAL INFORMATION PROCESSING, ICONIP 2023, PT IV, 2024, 14450 : 405 - 417
  • [28] Collaborative knowledge distillation via filter knowledge transfer
    Gou, Jianping
    Hu, Yue
    Sun, Liyuan
    Wang, Zhi
    Ma, Hongxing
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
  • [29] TC3KD: Knowledge distillation via teacher-student cooperative curriculum customization
    Wang, Chaofei
    Yang, Ke
    Zhang, Shaowei
    Huang, Gao
    Song, Shiji
    [J]. NEUROCOMPUTING, 2022, 508 : 284 - 292
  • [30] Leveraging angular distributions for improved knowledge distillation
    Jeon, Eun Som
    Choi, Hongjun
    Shukla, Ankita
    Turaga, Pavan
    [J]. NEUROCOMPUTING, 2023, 518 : 466 - 481