Parameter-Efficient and Student-Friendly Knowledge Distillation

被引:15
|
作者
Rao, Jun [1 ,2 ]
Meng, Xv [2 ]
Ding, Liang [3 ]
Qi, Shuhan [2 ,4 ]
Liu, Xuebo [2 ]
Zhang, Min [2 ]
Tao, Dacheng [3 ]
机构
[1] JD Explore Acad, Beijing, Peoples R China
[2] Harbin Inst Technol, Shenzhen 518055, Peoples R China
[3] Univ Sydney, Sch Comp Sci, Sydney, NSW 2006, Australia
[4] Guangdong Prov Key Lab Novel Secur Intelligence Te, Shenzhen 518000, Peoples R China
基金
中国国家自然科学基金;
关键词
Training; Smoothing methods; Knowledge transfer; Data models; Adaptation models; Predictive models; Knowledge engineering; Knowledge distillation; parameter-efficient; image classification;
D O I
10.1109/TMM.2023.3321480
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Pre-trained models are frequently employed in multimodal learning. However, these models have too many parameters and need too much effort to fine-tune the downstream tasks. Knowledge distillation (KD) is a method to transfer knowledge using the soft label from this pre-trained teacher model to a smaller student, where the parameters of the teacher are fixed (or partially) during training. Recent studies show that this mode may cause difficulties in knowledge transfer due to the mismatched model capacities. To alleviate the mismatch problem, adjustment of temperature parameters, label smoothing and teacher-student joint training methods (online distillation) to smooth the soft label of a teacher network, have been proposed. But those methods rarely explain the effect of smoothed soft labels to enhance the KD performance. The main contributions of our work are the discovery, analysis, and validation of the effect of the smoothed soft label and a less time-consuming and adaptive transfer of the pre-trained teacher's knowledge method, namely PESF-KD by adaptive tuning soft labels of the teacher network. Technically, we first mathematically formulate the mismatch as the sharpness gap between teacher's and student's predictive distributions, where we show such a gap can be narrowed with the appropriate smoothness of the soft label. Then, we introduce an adapter module for the teacher and only update the adapter to obtain soft labels with appropriate smoothness. Experiments on various benchmarks including CV and NLP show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
引用
收藏
页码:4230 / 4241
页数:12
相关论文
共 50 条
  • [41] Parameter-Efficient Model Adaptation for Vision Transformers
    He, Xuehai
    Li, Chuanyuan
    Zhang, Pengchuan
    Yang, Jianwei
    Wang, Xin Eric
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 817 - 825
  • [42] Modular and Parameter-Efficient Multimodal Fusion with Prompting
    Liang, Sheng
    Zhao, Mengjie
    Schuetze, Hinrich
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 2976 - 2985
  • [43] Parameter-Efficient Transfer Learning with Diff Pruning
    Guo, Demi
    Rush, Alexander M.
    Kim, Yoon
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 4884 - 4896
  • [44] Designing STEM-Specific Student-Friendly Reading Content for the Engineering English Classroom
    John, Divya
    Devi, G. Sandhiya
    IEEE TRANSACTIONS ON PROFESSIONAL COMMUNICATION, 2021, 64 (04) : 444 - 455
  • [45] Parameter-efficient fine-tuning of large language models using semantic knowledge tuning
    Prottasha, Nusrat Jahan
    Mahmud, Asif
    Sobuj, Md. Shohanur Islam
    Bhat, Prakash
    Kowsher, Md
    Yousefi, Niloofar
    Garibay, Ozlem Ozmen
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [46] Construction of student-friendly and scientifically-valid descriptions of electron-pushing diagrams
    Bhattacharyya, Gautam
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2015, 250
  • [47] PARAMETER-EFFICIENT TRANSFER LEARNING OF AUDIO SPECTROGRAM TRANSFORMERS
    Cappellazzo, Umberto
    Falavigna, Daniele
    Brutti, Alessio
    Ravanelli, Mirco
    2024 IEEE 34TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, MLSP 2024, 2024,
  • [48] Parameter-Efficient Sparse Retrievers and Rerankers Using Adapters
    Pal, Vaishali
    Lassance, Carlos
    Dejean, Herve
    Clinchant, Stephane
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2023, PT II, 2023, 13981 : 16 - 31
  • [49] Stochastic Bridges as Effective Regularizers for Parameter-Efficient Tuning
    Chen, Weize
    Han, Xu
    Lin, Yankai
    Liu, Zhiyuan
    Sun, Maosong
    Zhou, Jie
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 10400 - 10420
  • [50] PetS: A Unified Framework for Parameter-Efficient Transformers Serving
    Zhou, Zhe
    Wei, Xuechao
    Zhang, Jiejing
    Sun, Guangyu
    PROCEEDINGS OF THE 2022 USENIX ANNUAL TECHNICAL CONFERENCE, 2022, : 489 - 504