General Cross-Architecture Distillation of Pretrained Language Models into Matrix Embeddings

被引：1

作者：

Galke, Lukas ^{[1
]}

Cuber, Isabelle ^{[2
]}

Meyer, Christoph ^{[2
]}

Noelscher, Henrik Ferdinand ^{[2
]}

Sonderecker, Angelina ^{[2
]}

Scherp, Ansgar ^{[2
]}

机构：

[1] Max Planck Inst Psycholinguist, Nijmegen, Netherlands

[2] Univ Ulm, Ulm, Germany

来源：

2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2022年

关键词：

D O I：

10.1109/IJCNN55064.2022.9892144

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large pretrained language models (PreLMs) are revolutionizing natural language processing across all benchmarks. However, their sheer size is prohibitive for small laboratories or for deployment on mobile devices. Approaches like pruning and distillation reduce the model size but typically retain the same model architecture. In contrast, we explore distilling PreLMs into a different, more efficient architecture, Continual Multiplication of Words (CMOW), which embeds each word as a matrix and uses matrix multiplication to encode sequences. We extend the CMOW architecture and its CMOW/CBOW-Hybrid variant with a bidirectional component for more expressive power, per-token representations for a general (task-agnostic) distillation during pretraining, and a two-sequence encoding scheme that facilitates downstream tasks on sentence pairs, such as sentence similarity and natural language inference. Our matrix-based bidirectional CMOW/CBOW-Hybrid model is competitive to Disti1BERT on question similarity and recognizing textual entailment, but uses only half of the number of parameters and is three times faster in terms of inference speed. We match or exceed the scores of ELMo for all tasks of the GLUE benchmark except for the sentiment analysis task SST-2 and the linguistic acceptability task CoLA. However, compared to previous cross-architecture distillation approaches, we demonstrate a doubling of the scores on detecting linguistic acceptability. This shows that matrix-based embeddings can be used to distill large PreLM into competitive models and motivates further research in this direction.

引用

页数：10

共 30 条

[1] Cross-Architecture Knowledge Distillation
Liu, Yufan
Cao, Jiajiong
Li, Bing
Hu, Weiming
Ding, Jingting
Li, Liang
Maybank, Stephen
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (08) : 2798 - 2824
[2] Cross-Architecture Knowledge Distillation
Liu, Yufan
Cao, Jiajiong
Li, Bing
Hu, Weiming
Ding, Jingting
Li, Liang
[J]. COMPUTER VISION - ACCV 2022, PT V, 2023, 13845 : 179 - 195
[3] Cross-Architecture Distillation for Face Recognition
Zhao, Weisong
Zhu, Xiangyu
He, Zhixiang
Zhang, Xiao-Yu
Lei, Zhen
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 8076 - 8085
[4] Adaptive Cross-architecture Mutual Knowledge Distillation
Ni, Jianyuan
Tang, Hao
Shang, Yuzhang
Duan, Bin
Yan, Yan
[J]. 2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,
[5] FeatureMix: A General Adversarial Defense Method for Pretrained Language Models
Dong, Huoyuan
Wu, Longfei
Guan, Zhitao
[J]. IEEE CONFERENCE ON GLOBAL COMMUNICATIONS, GLOBECOM, 2023, : 3415 - 3420
[6] ElitePLM: An Empirical Study on General Language Ability Evaluation of Pretrained Language Models
Li, Junyi
Tang, Tianyi
Gong, Zheng
Yang, Lixin
Yu, Zhuohao
Chen, Zhipeng
Wang, Jingyuan
Zhao, Wayne Xin
Wen, Ji-Rong
[J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3519 - 3539
[7] Multimodality Self-distillation for Fast Inference of Vision and Language Pretrained Models
Kong, Jun
Wang, Jin
Yu, Liang-Chih
Zhang, Xuejie
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8928 - 8940
[8] LEVERAGING ACOUSTIC AND LINGUISTIC EMBEDDINGS FROM PRETRAINED SPEECH AND LANGUAGE MODELS FOR INTENT CLASSIFICATION
Sharma, Bidisha
Madhavi, Maulik
Li, Haizhou
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7498 - 7502
[9] Expanding Language-Image Pretrained Models for General Video Recognition
Ni, Bolin
Peng, Houwen
Chen, Minghao
Zhang, Songyang
Meng, Gaofeng
Fu, Jianlong
Xiang, Shiming
Ling, Haibin
[J]. COMPUTER VISION - ECCV 2022, PT IV, 2022, 13664 : 1 - 18
[10] Pretrained Transformer Language Models Versus Pretrained Word Embeddings for the Detection of Accurate Health Information on Arabic Social Media: Comparative Study
Albalawi, Yahya
Nikolov, Nikola S.
Buckley, Jim
[J]. JMIR FORMATIVE RESEARCH, 2022, 6 (06)

← 1 2 3 →