Causal Distillation for Language Models

被引:0
|
作者
Wu, Zhengxuan [1 ]
Geiger, Atticus [1 ]
Rozner, Joshua [1 ]
Kreiss, Elisa [1 ]
Lu, Hanson [1 ]
Icard, Thomas [1 ]
Potts, Christopher [1 ]
Goodman, Noah [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
关键词
EXPLANATION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Distillation efforts have led to language models that are more compact and efficient without serious drops in performance. The standard approach to distillation trains a student model against two objectives: a task-specific objective (e.g., language modeling) and an imitation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the causal dynamics of the teacher through a distillation interchange intervention training objective (DIITO). DIITO pushes the student model to become a causal abstraction of the teacher model - a faithful model with simpler causal structure. DIITO is fully differentiable, easily implemented, and combines flexibly with other objectives. Compared against standard distillation with the same setting, DIITO results in lower perplexity on the WikiText-103M corpus (masked language modeling) and marked improvements on the GLUE benchmark (natural language understanding), SQuAD (question answering), and CoNLL-2003 (named entity recognition).
引用
收藏
页码:4288 / 4295
页数:8
相关论文
共 50 条
  • [21] Knowledge Base Grounded Pre-trained Language Models via Distillation
    Sourty, Raphael
    Moreno, Jose G.
    Servant, Francois-Paul
    Tamine, Lynda
    [J]. 39TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2024, 2024, : 1617 - 1625
  • [22] General Cross-Architecture Distillation of Pretrained Language Models into Matrix Embeddings
    Galke, Lukas
    Cuber, Isabelle
    Meyer, Christoph
    Noelscher, Henrik Ferdinand
    Sonderecker, Angelina
    Scherp, Ansgar
    [J]. 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [23] Multimodality Self-distillation for Fast Inference of Vision and Language Pretrained Models
    Kong, Jun
    Wang, Jin
    Yu, Liang-Chih
    Zhang, Xuejie
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8928 - 8940
  • [24] Beyond Structural Causal Models: Causal Constraints Models
    Blom, Tineke
    Bongers, Stephan
    Mooij, Joris M.
    [J]. 35TH UNCERTAINTY IN ARTIFICIAL INTELLIGENCE CONFERENCE (UAI 2019), 2020, 115 : 585 - 594
  • [25] A causal partition of trait correlations: using graphical models to derive statistical models from theoretical language
    Cronin, James Patrick
    Schoolmaster, Donald R.
    [J]. ECOSPHERE, 2018, 9 (09):
  • [26] Coherent control of the causal order of entanglement distillation
    Zuo, Zai
    Hanks, Michael
    Kim, M. S.
    [J]. PHYSICAL REVIEW A, 2023, 108 (06)
  • [27] Causal State Distillation for Explainable Reinforcement Learning
    Lu, Wenhao
    Zhao, Xufeng
    Fryen, Thilo
    Lee, Jae Hee
    Li, Mengdi
    Magg, Sven
    Wermter, Stefan
    [J]. CAUSAL LEARNING AND REASONING, VOL 236, 2024, 236 : 106 - 142
  • [28] Causal Models
    Levine, Beverly
    [J]. EPIDEMIOLOGY, 2009, 20 (06) : 931 - 931
  • [29] Causal models
    Garbolino, Paolo
    [J]. APPLIED COGNITIVE PSYCHOLOGY, 2006, 20 (09) : 1243 - 1245
  • [30] ReAugKD: Retrieval-Augmented Knowledge Distillation For Pre-trained Language Models
    Zhang, Jianyi
    Muhamed, Aashiq
    Anantharaman, Aditya
    Wang, Guoyin
    Chen, Changyou
    Zhong, Kai
    Cui, Qingjun
    Xu, Yi
    Zeng, Belinda
    Chilimbi, Trishul
    Chen, Yiran
    [J]. 61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 1128 - 1136