f-Divergence Minimization for Sequence-Level Knowledge Distillation

被引:0
|
作者
Wen, Yuqiao [1 ,2 ]
Li, Zichao [3 ]
Du, Wenyu [4 ]
Mou, Lili [1 ,2 ,5 ]
机构
[1] Univ Alberta, Dept Comp Sci, Edmonton, AB, Canada
[2] Univ Alberta, Alberta Machine Intelligence Inst Amii, Edmonton, AB, Canada
[3] McGill Univ, Mila, Montreal, PQ, Canada
[4] Univ Hong Kong, Hong Kong, Peoples R China
[5] Amii, Edmonton, AB, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one. It has gained increasing attention in the natural language processing community, driven by the demands of compressing ever-growing language models. In this work, we propose an f-DISTILL framework, which formulates sequence-level knowledge distillation as minimizing a generalized f-divergence function. We propose four distilling variants under our framework and show that existing SeqKD and ENGINE approaches are approximations of our f - DISTILL methods. We further derive step-wise decomposition for our f - DISTILL, reducing intractable sequence-level divergence to word-level losses that can be computed in a tractable manner. Experiments across four datasets show that our methods outperform existing KD approaches, and that our symmetric distilling losses can better force the student to learn from the teacher distribution.(1)
引用
收藏
页码:10817 / 10834
页数:18
相关论文
共 50 条
  • [1] Investigation of Sequence-level Knowledge Distillation Methods for CTC Acoustic Models
    Takashima, Ryoichi
    Sheng, Li
    Kawai, Hisashi
    ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2019, 2019-May : 6156 - 6160
  • [2] INVESTIGATION OF SEQUENCE-LEVEL KNOWLEDGE DISTILLATION METHODS FOR CTC ACOUSTIC MODELS
    Takashima, Ryoichi
    Sheng, Li
    Kawai, Hisashi
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6156 - 6160
  • [3] On Convergence in Wasserstein Distance and f-divergence Minimization Problems
    Li, Cheuk Ting
    Zhang, Jingwei
    Farnia, Farzan
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
  • [4] Mutual-learning sequence-level knowledge distillation for automatic speech recognition
    Li, Zerui
    Ming, Yue
    Yang, Lei
    Xue, Jing-Hao
    NEUROCOMPUTING, 2021, 428 : 259 - 267
  • [5] SEQUENCE-LEVEL KNOWLEDGE DISTILLATION FOR MODEL COMPRESSION OF ATTENTION-BASED SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION
    Mun'im, Raden Mu'az
    Inoue, Nakamasa
    Shinoda, Koichi
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6151 - 6155
  • [6] f-Divergence Inequalities
    Sason, Igal
    Verdu, Sergio
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2016, 62 (11) : 5973 - 6006
  • [7] On f-divergence for σ-⊕-measures
    Agahi, Hamzeh
    Yadollahzadeh, Milad
    SOFT COMPUTING, 2021, 25 (15) : 9781 - 9787
  • [8] GEOMETRY OF F-DIVERGENCE
    VOS, PW
    ANNALS OF THE INSTITUTE OF STATISTICAL MATHEMATICS, 1991, 43 (03) : 515 - 537
  • [9] Sequence-Level Knowledge Distillation for Class-Incremental End-to-End Spoken Language Understanding
    Cappellazzo, Umberto
    Yang, Muqiao
    Falavigna, Daniele
    Brutti, Alessio
    INTERSPEECH 2023, 2023, : 2953 - 2957
  • [10] Dynamics of the f-Divergence Minimization Processes Based on the Speed-Gradient Principle
    Shalymov, Dmitry S.
    Fradkov, Alexander L.
    2016 IEEE CONFERENCE ON NORBERT WIENER IN THE 21ST CENTURY (21CW), 2016, : 7 - 11