Parameter-Efficient and Student-Friendly Knowledge Distillation

被引:15
|
作者
Rao, Jun [1 ,2 ]
Meng, Xv [2 ]
Ding, Liang [3 ]
Qi, Shuhan [2 ,4 ]
Liu, Xuebo [2 ]
Zhang, Min [2 ]
Tao, Dacheng [3 ]
机构
[1] JD Explore Acad, Beijing, Peoples R China
[2] Harbin Inst Technol, Shenzhen 518055, Peoples R China
[3] Univ Sydney, Sch Comp Sci, Sydney, NSW 2006, Australia
[4] Guangdong Prov Key Lab Novel Secur Intelligence Te, Shenzhen 518000, Peoples R China
基金
中国国家自然科学基金;
关键词
Training; Smoothing methods; Knowledge transfer; Data models; Adaptation models; Predictive models; Knowledge engineering; Knowledge distillation; parameter-efficient; image classification;
D O I
10.1109/TMM.2023.3321480
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Pre-trained models are frequently employed in multimodal learning. However, these models have too many parameters and need too much effort to fine-tune the downstream tasks. Knowledge distillation (KD) is a method to transfer knowledge using the soft label from this pre-trained teacher model to a smaller student, where the parameters of the teacher are fixed (or partially) during training. Recent studies show that this mode may cause difficulties in knowledge transfer due to the mismatched model capacities. To alleviate the mismatch problem, adjustment of temperature parameters, label smoothing and teacher-student joint training methods (online distillation) to smooth the soft label of a teacher network, have been proposed. But those methods rarely explain the effect of smoothed soft labels to enhance the KD performance. The main contributions of our work are the discovery, analysis, and validation of the effect of the smoothed soft label and a less time-consuming and adaptive transfer of the pre-trained teacher's knowledge method, namely PESF-KD by adaptive tuning soft labels of the teacher network. Technically, we first mathematically formulate the mismatch as the sharpness gap between teacher's and student's predictive distributions, where we show such a gap can be narrowed with the appropriate smoothness of the soft label. Then, we introduce an adapter module for the teacher and only update the adapter to obtain soft labels with appropriate smoothness. Experiments on various benchmarks including CV and NLP show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
引用
收藏
页码:4230 / 4241
页数:12
相关论文
共 50 条
  • [1] Student-friendly knowledge distillation
    Yuan, Mengyang
    Lang, Bo
    Quan, Fengnan
    KNOWLEDGE-BASED SYSTEMS, 2024, 296
  • [2] Learning Student-Friendly Teacher Networks for Knowledge Distillation
    Park, Dae Young
    Cha, Moon-Hyun
    Jeong, Changwook
    Kim, Dae Sin
    Han, Bohyung
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [3] PET: Parameter-efficient Knowledge Distillation on Transformer
    Jeon, Hyojin
    Park, Seungcheol
    Kim, Jin-Gee
    Kang, U.
    PLOS ONE, 2023, 18 (07):
  • [4] Parameter-efficient online knowledge distillation for pretrained language models
    Wang, Yukun
    Wang, Jin
    Zhang, Xuejie
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 265
  • [5] Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT
    Cho, Ikhyun
    Kang, U.
    PLOS ONE, 2022, 17 (02):
  • [6] SFT-KD-Recon: Learning a Student-friendly Teacher for Knowledge Distillation in Magnetic Resonance Image Reconstruction
    Gayathri, Matcha Naga
    Ramanarayanan, Sriprabha
    Al Fahiml, Mohammad
    Rahul, G. S.
    Ram, Keerthi
    Sivaprakasam, Mohanasankar
    MEDICAL IMAGING WITH DEEP LEARNING, VOL 227, 2023, 227 : 1423 - 1440
  • [7] Creating Student-Friendly Tests
    Salend, Spencer J.
    EDUCATIONAL LEADERSHIP, 2011, 69 (03) : 52 - 58
  • [8] Student-Friendly Teaching Approaches
    Ari, Asim
    Schmitt, Nicolas
    INTERNATIONAL JOURNAL OF INSTRUCTION, 2022, 15 (02) : I - III
  • [9] Student-Friendly Guide to Molecular Integrals
    Murphy, Kevin V.
    Turney, Justin M.
    Schaefer, Henry F., III
    JOURNAL OF CHEMICAL EDUCATION, 2018, 95 (09) : 1572 - 1578
  • [10] MiniALBERT: Model Distillation via Parameter-Efficient Recursive Transformers
    Nouriborji, Mohammadmahdi
    Rohanian, Omid
    Kouchaki, Samaneh
    Clifton, David A.
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 1161 - 1173