Multimodality Self-distillation for Fast Inference of Vision and Language Pretrained Models

被引:0
|
作者
Kong, Jun [1 ]
Wang, Jin [1 ]
Yu, Liang-Chih [2 ]
Zhang, Xuejie [1 ]
机构
[1] Yunnan Univ, Sch Informat Sci & Engn, Kunming 650000, Peoples R China
[2] Yuan Ze Univ, Dept Informat Management, Taoyuan 32003, Taiwan
基金
中国国家自然科学基金;
关键词
Computational modeling; Transformers; Task analysis; Visualization; Semantics; Training; Quantization (signal); Accelerating inference; early exiting; multimodality self-distillation; vision and language pretrained models;
D O I
10.1109/TMM.2024.3384060
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The computational cost of the vision and language pretrained models (VL-PTMs) limits their deployment in resource-constrained devices that require low latency. One existing solution is to apply the early exiting (EE) strategy to accelerate the inference. This technique can force model prediction using only a few former transformer layers. However, these former layers behave differently with the final classifier, inevitably resulting in performance decline. To counter such limitation, self-distillation has been commonly introduced to enhance the representation abilities of the EE classifiers. This results in a semantic gap since EE classifiers are directly trained to mimic the outputs of the final classifier without access to the modality-specific behaviors. This study proposes a multimodality self-distillation method for the fast inference of VL-PTMs. To fill the semantic gap between modalities, we split the multimodalities into separate modalities and added them as extra inputs to encourage the effective distillation of each modality. Furthermore, the mean squared error (MSE) is introduced to minimize the distance of feature maps and further enhance the representation ability of the EE classifiers. Experiments show that the proposed method outperforms the previous EE strategies with the same inference time, and performs competitively even if the model exited very early.
引用
收藏
页码:8928 / 8940
页数:13
相关论文
共 36 条
  • [1] Adaptive Ensemble Self-Distillation With Consistent Gradients for Fast Inference of Pretrained Language Models
    Kong, Jun
    Wang, Jin
    Zhang, Xuejie
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 430 - 442
  • [2] Self-distillation improves self-supervised learning for DNA sequence inference
    Yu, Tong
    Cheng, Lei
    Khalitov, Ruslan
    Olsson, Erland B.
    Yang, Zhirong
    [J]. Neural Networks, 2025, 183
  • [3] Self-Distillation and Pinyin Character Prediction for Chinese Spelling Correction Based on Multimodality
    He, Li
    Liu, Feng
    Liu, Jie
    Duan, Jianyong
    Wang, Hao
    [J]. APPLIED SCIENCES-BASEL, 2024, 14 (04):
  • [4] Layerwised multimodal knowledge distillation for vision-language pretrained model
    Wang, Jin
    Liao, Dawei
    Zhang, You
    Xu, Dan
    Zhang, Xuejie
    [J]. NEURAL NETWORKS, 2024, 175
  • [5] Mitigating Membership Inference Attacks by Self-Distillation Through a Novel Ensemble Architecture
    Tang, Xinyu
    Mahloujifar, Saeed
    Song, Liwei
    Shejwalkar, Virat
    Nasr, Milad
    Houmansadr, Amir
    Mittal, Prateek
    [J]. PROCEEDINGS OF THE 31ST USENIX SECURITY SYMPOSIUM, 2022, : 1433 - 1450
  • [6] Enhancing pretrained language models with structured commonsense knowledge for textual inference
    Du, Li
    Ding, Xiao
    Xiong, Kai
    Liu, Ting
    Qin, Bing
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 254
  • [7] Reminding the incremental language model via data-free self-distillation
    Han Wang
    Ruiliu Fu
    Chengzhang Li
    Xuejun Zhang
    Jun Zhou
    Xing Bai
    Yonghong Yan
    Qingwei Zhao
    [J]. Applied Intelligence, 2023, 53 : 9298 - 9320
  • [8] A multi-level collaborative self-distillation learning for improving adaptive inference efficiency
    Zhang, Likun
    Li, Jinbao
    Zhang, Benqian
    Guo, Yahong
    [J]. COMPLEX & INTELLIGENT SYSTEMS, 2024,
  • [9] General Cross-Architecture Distillation of Pretrained Language Models into Matrix Embeddings
    Galke, Lukas
    Cuber, Isabelle
    Meyer, Christoph
    Noelscher, Henrik Ferdinand
    Sonderecker, Angelina
    Scherp, Ansgar
    [J]. 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [10] MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
    Dong, Xiaoyi
    Bao, Jianmin
    Zheng, Yinglin
    Zhang, Ting
    Chen, Dongdong
    Yang, Hao
    Zeng, Ming
    Zhang, Weiming
    Yuan, Lu
    Chen, Dong
    Wen, Fang
    Yu, Nenghai
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10995 - 11005