General Cross-Architecture Distillation of Pretrained Language Models into Matrix Embeddings

被引:1
|
作者
Galke, Lukas [1 ]
Cuber, Isabelle [2 ]
Meyer, Christoph [2 ]
Noelscher, Henrik Ferdinand [2 ]
Sonderecker, Angelina [2 ]
Scherp, Ansgar [2 ]
机构
[1] Max Planck Inst Psycholinguist, Nijmegen, Netherlands
[2] Univ Ulm, Ulm, Germany
关键词
D O I
10.1109/IJCNN55064.2022.9892144
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large pretrained language models (PreLMs) are revolutionizing natural language processing across all benchmarks. However, their sheer size is prohibitive for small laboratories or for deployment on mobile devices. Approaches like pruning and distillation reduce the model size but typically retain the same model architecture. In contrast, we explore distilling PreLMs into a different, more efficient architecture, Continual Multiplication of Words (CMOW), which embeds each word as a matrix and uses matrix multiplication to encode sequences. We extend the CMOW architecture and its CMOW/CBOW-Hybrid variant with a bidirectional component for more expressive power, per-token representations for a general (task-agnostic) distillation during pretraining, and a two-sequence encoding scheme that facilitates downstream tasks on sentence pairs, such as sentence similarity and natural language inference. Our matrix-based bidirectional CMOW/CBOW-Hybrid model is competitive to Disti1BERT on question similarity and recognizing textual entailment, but uses only half of the number of parameters and is three times faster in terms of inference speed. We match or exceed the scores of ELMo for all tasks of the GLUE benchmark except for the sentiment analysis task SST-2 and the linguistic acceptability task CoLA. However, compared to previous cross-architecture distillation approaches, we demonstrate a doubling of the scores on detecting linguistic acceptability. This shows that matrix-based embeddings can be used to distill large PreLM into competitive models and motivates further research in this direction.
引用
收藏
页数:10
相关论文
共 30 条
  • [1] Cross-Architecture Knowledge Distillation
    Liu, Yufan
    Cao, Jiajiong
    Li, Bing
    Hu, Weiming
    Ding, Jingting
    Li, Liang
    Maybank, Stephen
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (08) : 2798 - 2824
  • [2] Cross-Architecture Knowledge Distillation
    Liu, Yufan
    Cao, Jiajiong
    Li, Bing
    Hu, Weiming
    Ding, Jingting
    Li, Liang
    [J]. COMPUTER VISION - ACCV 2022, PT V, 2023, 13845 : 179 - 195
  • [3] Cross-Architecture Distillation for Face Recognition
    Zhao, Weisong
    Zhu, Xiangyu
    He, Zhixiang
    Zhang, Xiao-Yu
    Lei, Zhen
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 8076 - 8085
  • [4] Adaptive Cross-architecture Mutual Knowledge Distillation
    Ni, Jianyuan
    Tang, Hao
    Shang, Yuzhang
    Duan, Bin
    Yan, Yan
    [J]. 2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,
  • [5] FeatureMix: A General Adversarial Defense Method for Pretrained Language Models
    Dong, Huoyuan
    Wu, Longfei
    Guan, Zhitao
    [J]. IEEE CONFERENCE ON GLOBAL COMMUNICATIONS, GLOBECOM, 2023, : 3415 - 3420
  • [6] ElitePLM: An Empirical Study on General Language Ability Evaluation of Pretrained Language Models
    Li, Junyi
    Tang, Tianyi
    Gong, Zheng
    Yang, Lixin
    Yu, Zhuohao
    Chen, Zhipeng
    Wang, Jingyuan
    Zhao, Wayne Xin
    Wen, Ji-Rong
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3519 - 3539
  • [7] Multimodality Self-distillation for Fast Inference of Vision and Language Pretrained Models
    Kong, Jun
    Wang, Jin
    Yu, Liang-Chih
    Zhang, Xuejie
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8928 - 8940
  • [8] LEVERAGING ACOUSTIC AND LINGUISTIC EMBEDDINGS FROM PRETRAINED SPEECH AND LANGUAGE MODELS FOR INTENT CLASSIFICATION
    Sharma, Bidisha
    Madhavi, Maulik
    Li, Haizhou
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7498 - 7502
  • [9] Expanding Language-Image Pretrained Models for General Video Recognition
    Ni, Bolin
    Peng, Houwen
    Chen, Minghao
    Zhang, Songyang
    Meng, Gaofeng
    Fu, Jianlong
    Xiang, Shiming
    Ling, Haibin
    [J]. COMPUTER VISION - ECCV 2022, PT IV, 2022, 13664 : 1 - 18
  • [10] Pretrained Transformer Language Models Versus Pretrained Word Embeddings for the Detection of Accurate Health Information on Arabic Social Media: Comparative Study
    Albalawi, Yahya
    Nikolov, Nikola S.
    Buckley, Jim
    [J]. JMIR FORMATIVE RESEARCH, 2022, 6 (06)