Vision-language pre-training via modal interaction

被引:0
|
作者
Cheng, Hang [1 ]
Ye, Hehui [2 ]
Zhou, Xiaofei [3 ]
Liu, Ximeng [2 ]
Chen, Fei [2 ]
Wang, Meiqing [1 ]
机构
[1] Fuzhou Univ, Sch Math & Stat, Fuzhou 350108, Peoples R China
[2] Fuzhou Univ, Coll Comp & Data Sci, Fuzhou 350108, Peoples R China
[3] Hangzhou Dianzi Univ, Sch Automation, Hangzhou 310018, Peoples R China
基金
中国国家自然科学基金;
关键词
Cross-modal; Pre-training; Partial auxiliary; Image captioning;
D O I
10.1016/j.patcog.2024.110809
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing vision-language pre-training models typically extract region features and conduct fine-grained local alignment based on masked image/text completion or object detection methods. However, these models often design independent subtasks for different modalities, which may not adequately leverage interactions between modalities, requiring large datasets to achieve optimal performance. To address these limitations, this paper introduces a novel pre-training approach that facilitates fine-grained vision-language interaction. We propose two new subtasks - image filling and text filling - that utilize data from one modality to complete missing parts in another, enhancing the model's ability to integrate multi-modal information. A selector mechanism is also developed to minimize semantic overlap between modalities, thereby improving the efficiency and effectiveness of the pre-trained model. Our comprehensive experimental results demonstrate that our approach not only fosters better semantic associations among different modalities but also achieves state-of-the-art performance on downstream vision-language tasks with significantly smaller datasets.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
    Wang, Teng
    Jiang, Wenhao
    Lu, Zhichao
    Zheng, Feng
    Cheng, Ran
    Yin, Chengguo
    Luo, Ping
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [2] Survey on Vision-language Pre-training
    Yin, Jiong
    Zhang, Zhe-Dong
    Gao, Yu-Han
    Yang, Zhi-Wen
    Li, Liang
    Xiao, Mang
    Sun, Yao-Qi
    Yan, Cheng-Gang
    [J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2000 - 2023
  • [3] VLP: A Survey on Vision-language Pre-training
    Chen, Fei-Long
    Zhang, Du-Zhen
    Han, Ming-Lun
    Chen, Xiu-Yi
    Shi, Jing
    Xu, Shuang
    Xu, Bo
    [J]. MACHINE INTELLIGENCE RESEARCH, 2023, 20 (01) : 38 - 56
  • [4] VLP: A Survey on Vision-language Pre-training
    Fei-Long Chen
    Du-Zhen Zhang
    Ming-Lun Han
    Xiu-Yi Chen
    Jing Shi
    Shuang Xu
    Bo Xu
    [J]. Machine Intelligence Research, 2023, 20 (01) : 38 - 56
  • [5] VLP: A Survey on Vision-language Pre-training
    Fei-Long Chen
    Du-Zhen Zhang
    Ming-Lun Han
    Xiu-Yi Chen
    Jing Shi
    Shuang Xu
    Bo Xu
    [J]. Machine Intelligence Research, 2023, 20 : 38 - 56
  • [6] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
    Mu, Yao
    Zhang, Qinglong
    Hu, Mengkang
    Wang, Wenhai
    Ding, Mingyu
    Jin, Jun
    Wang, Bin
    Dai, Jifeng
    Qiao, Yu
    Luo, Ping
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [7] Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training
    Moon, Jong Hak
    Lee, Hyungyung
    Shin, Woncheol
    Kim, Young-Hak
    Choi, Edward
    [J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (12) : 6070 - 6080
  • [8] Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
    Jian, Yiren
    Gao, Chongyang
    Vosoughi, Soroush
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [9] Pre-training A Prompt Pool for Vision-Language Model
    Liu, Jun
    Gu, Yang
    Yang, Zhaohua
    Guo, Shuai
    Liu, Huaqiu
    Chen, Yiqiang
    [J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [10] Contrastive Vision-Language Pre-training with Limited Resources
    Cui, Quan
    Zhou, Boyan
    Guo, Yu
    Yin, Weidong
    Wu, Hao
    Yoshie, Osamu
    Chen, Yubo
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 236 - 253