Parameter-Efficient and Student-Friendly Knowledge Distillation

被引：15

作者：

Rao, Jun ^{[1
,2
]}

Meng, Xv ^{[2
]}

Ding, Liang ^{[3
]}

Qi, Shuhan ^{[2
,4
]}

Liu, Xuebo ^{[2
]}

Zhang, Min ^{[2
]}

Tao, Dacheng ^{[3
]}

机构：

[1] JD Explore Acad, Beijing, Peoples R China

[2] Harbin Inst Technol, Shenzhen 518055, Peoples R China

[3] Univ Sydney, Sch Comp Sci, Sydney, NSW 2006, Australia

[4] Guangdong Prov Key Lab Novel Secur Intelligence Te, Shenzhen 518000, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

中国国家自然科学基金;

关键词：

Training; Smoothing methods; Knowledge transfer; Data models; Adaptation models; Predictive models; Knowledge engineering; Knowledge distillation; parameter-efficient; image classification;

D O I：

10.1109/TMM.2023.3321480

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Pre-trained models are frequently employed in multimodal learning. However, these models have too many parameters and need too much effort to fine-tune the downstream tasks. Knowledge distillation (KD) is a method to transfer knowledge using the soft label from this pre-trained teacher model to a smaller student, where the parameters of the teacher are fixed (or partially) during training. Recent studies show that this mode may cause difficulties in knowledge transfer due to the mismatched model capacities. To alleviate the mismatch problem, adjustment of temperature parameters, label smoothing and teacher-student joint training methods (online distillation) to smooth the soft label of a teacher network, have been proposed. But those methods rarely explain the effect of smoothed soft labels to enhance the KD performance. The main contributions of our work are the discovery, analysis, and validation of the effect of the smoothed soft label and a less time-consuming and adaptive transfer of the pre-trained teacher's knowledge method, namely PESF-KD by adaptive tuning soft labels of the teacher network. Technically, we first mathematically formulate the mismatch as the sharpness gap between teacher's and student's predictive distributions, where we show such a gap can be narrowed with the appropriate smoothness of the soft label. Then, we introduce an adapter module for the teacher and only update the adapter to obtain soft labels with appropriate smoothness. Experiments on various benchmarks including CV and NLP show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.

引用

页码：4230 / 4241

页数：12

共 50 条

[1] Student-friendly knowledge distillation
Yuan, Mengyang
Lang, Bo
Quan, Fengnan
KNOWLEDGE-BASED SYSTEMS, 2024, 296
[2] Learning Student-Friendly Teacher Networks for Knowledge Distillation
Park, Dae Young
Cha, Moon-Hyun
Jeong, Changwook
Kim, Dae Sin
Han, Bohyung
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[3] PET: Parameter-efficient Knowledge Distillation on Transformer
Jeon, Hyojin
Park, Seungcheol
Kim, Jin-Gee
Kang, U.
PLOS ONE, 2023, 18 (07):
[4] Parameter-efficient online knowledge distillation for pretrained language models
Wang, Yukun
Wang, Jin
Zhang, Xuejie
EXPERT SYSTEMS WITH APPLICATIONS, 2025, 265
[5] Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT
Cho, Ikhyun
Kang, U.
PLOS ONE, 2022, 17 (02):
[6] SFT-KD-Recon: Learning a Student-friendly Teacher for Knowledge Distillation in Magnetic Resonance Image Reconstruction
Gayathri, Matcha Naga
Ramanarayanan, Sriprabha
Al Fahiml, Mohammad
Rahul, G. S.
Ram, Keerthi
Sivaprakasam, Mohanasankar
MEDICAL IMAGING WITH DEEP LEARNING, VOL 227, 2023, 227 : 1423 - 1440
[7] Creating Student-Friendly Tests
Salend, Spencer J.
EDUCATIONAL LEADERSHIP, 2011, 69 (03) : 52 - 58
[8] Student-Friendly Teaching Approaches
Ari, Asim
Schmitt, Nicolas
INTERNATIONAL JOURNAL OF INSTRUCTION, 2022, 15 (02) : I - III
[9] Student-Friendly Guide to Molecular Integrals
Murphy, Kevin V.
Turney, Justin M.
Schaefer, Henry F., III
JOURNAL OF CHEMICAL EDUCATION, 2018, 95 (09) : 1572 - 1578
[10] MiniALBERT: Model Distillation via Parameter-Efficient Recursive Transformers
Nouriborji, Mohammadmahdi
Rohanian, Omid
Kouchaki, Samaneh
Clifton, David A.
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 1161 - 1173

← 1 2 3 4 5 →