f-Divergence Minimization for Sequence-Level Knowledge Distillation

被引：0

作者：

Wen, Yuqiao ^{[1
,2
]}

Li, Zichao ^{[3
]}

Du, Wenyu ^{[4
]}

Mou, Lili ^{[1
,2
,5
]}

机构：

[1] Univ Alberta, Dept Comp Sci, Edmonton, AB, Canada

[2] Univ Alberta, Alberta Machine Intelligence Inst Amii, Edmonton, AB, Canada

[3] McGill Univ, Mila, Montreal, PQ, Canada

[4] Univ Hong Kong, Hong Kong, Peoples R China

[5] Amii, Edmonton, AB, Canada

来源：

PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1 | 2023年

基金：

加拿大自然科学与工程研究理事会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one. It has gained increasing attention in the natural language processing community, driven by the demands of compressing ever-growing language models. In this work, we propose an f-DISTILL framework, which formulates sequence-level knowledge distillation as minimizing a generalized f-divergence function. We propose four distilling variants under our framework and show that existing SeqKD and ENGINE approaches are approximations of our f - DISTILL methods. We further derive step-wise decomposition for our f - DISTILL, reducing intractable sequence-level divergence to word-level losses that can be computed in a tractable manner. Experiments across four datasets show that our methods outperform existing KD approaches, and that our symmetric distilling losses can better force the student to learn from the teacher distribution.(1)

引用

页码：10817 / 10834

页数：18

共 50 条

[1] Investigation of Sequence-level Knowledge Distillation Methods for CTC Acoustic Models
Takashima, Ryoichi
Sheng, Li
Kawai, Hisashi
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2019, 2019-May : 6156 - 6160
[2] INVESTIGATION OF SEQUENCE-LEVEL KNOWLEDGE DISTILLATION METHODS FOR CTC ACOUSTIC MODELS
Takashima, Ryoichi
Sheng, Li
Kawai, Hisashi
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6156 - 6160
[3] On Convergence in Wasserstein Distance and f-divergence Minimization Problems
Li, Cheuk Ting
Zhang, Jingwei
Farnia, Farzan
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
[4] Mutual-learning sequence-level knowledge distillation for automatic speech recognition
Li, Zerui
Ming, Yue
Yang, Lei
Xue, Jing-Hao
NEUROCOMPUTING, 2021, 428 : 259 - 267
[5] SEQUENCE-LEVEL KNOWLEDGE DISTILLATION FOR MODEL COMPRESSION OF ATTENTION-BASED SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION
Mun'im, Raden Mu'az
Inoue, Nakamasa
Shinoda, Koichi
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6151 - 6155
[6] f-Divergence Inequalities
Sason, Igal
Verdu, Sergio
IEEE TRANSACTIONS ON INFORMATION THEORY, 2016, 62 (11) : 5973 - 6006
[7] On f-divergence for σ-⊕-measures
Agahi, Hamzeh
Yadollahzadeh, Milad
SOFT COMPUTING, 2021, 25 (15) : 9781 - 9787
[8] GEOMETRY OF F-DIVERGENCE
VOS, PW
ANNALS OF THE INSTITUTE OF STATISTICAL MATHEMATICS, 1991, 43 (03) : 515 - 537
[9] Sequence-Level Knowledge Distillation for Class-Incremental End-to-End Spoken Language Understanding
Cappellazzo, Umberto
Yang, Muqiao
Falavigna, Daniele
Brutti, Alessio
INTERSPEECH 2023, 2023, : 2953 - 2957
[10] Dynamics of the f-Divergence Minimization Processes Based on the Speed-Gradient Principle
Shalymov, Dmitry S.
Fradkov, Alexander L.
2016 IEEE CONFERENCE ON NORBERT WIENER IN THE 21ST CENTURY (21CW), 2016, : 7 - 11

← 1 2 3 4 5 →