Vision-language pre-training via modal interaction

被引:0
|
作者
Cheng, Hang [1 ]
Ye, Hehui [2 ]
Zhou, Xiaofei [3 ]
Liu, Ximeng [2 ]
Chen, Fei [2 ]
Wang, Meiqing [1 ]
机构
[1] Fuzhou Univ, Sch Math & Stat, Fuzhou 350108, Peoples R China
[2] Fuzhou Univ, Coll Comp & Data Sci, Fuzhou 350108, Peoples R China
[3] Hangzhou Dianzi Univ, Sch Automation, Hangzhou 310018, Peoples R China
基金
中国国家自然科学基金;
关键词
Cross-modal; Pre-training; Partial auxiliary; Image captioning;
D O I
10.1016/j.patcog.2024.110809
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing vision-language pre-training models typically extract region features and conduct fine-grained local alignment based on masked image/text completion or object detection methods. However, these models often design independent subtasks for different modalities, which may not adequately leverage interactions between modalities, requiring large datasets to achieve optimal performance. To address these limitations, this paper introduces a novel pre-training approach that facilitates fine-grained vision-language interaction. We propose two new subtasks - image filling and text filling - that utilize data from one modality to complete missing parts in another, enhancing the model's ability to integrate multi-modal information. A selector mechanism is also developed to minimize semantic overlap between modalities, thereby improving the efficiency and effectiveness of the pre-trained model. Our comprehensive experimental results demonstrate that our approach not only fosters better semantic associations among different modalities but also achieves state-of-the-art performance on downstream vision-language tasks with significantly smaller datasets.
引用
收藏
页数:8
相关论文
共 50 条
  • [31] Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model
    Liang, Mingliang
    Larson, Martha
    [J]. PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 61 - 67
  • [32] Source-Free Domain Adaptation Guided by Vision and Vision-Language Pre-training
    Zhang, Wenyu
    Shen, Li
    Foo, Chuan-Sheng
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024,
  • [33] COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
    Lu, Haoyu
    Fei, Nanyi
    Huo, Yuqi
    Gao, Yizhao
    Lu, Zhiwu
    Wen, Ji-Rong
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15671 - 15680
  • [34] Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark
    Xu, Li
    Liu, Bo
    Khan, Ameer Hamza
    Fan, Lu
    Wu, Xiao-Ming
    [J]. CONFERENCE ON HEALTH, INFERENCE, AND LEARNING, VOL 209, 2023, 209 : 117 - +
  • [35] MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
    Ji, Yatai
    Wang, Junjie
    Gong, Yuan
    Zhang, Lin
    Zhu, Yanru
    Wang, Hongfa
    Zhang, Jiaxing
    Sakai, Tetsuya
    Yang, Yujiu
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23262 - 23271
  • [36] Automated Bridge Inspection Image Interpretation Based on Vision-Language Pre-Training
    Wang, Shengyi
    El-Gohary, Nora
    [J]. COMPUTING IN CIVIL ENGINEERING 2023-DATA, SENSING, AND ANALYTICS, 2024, : 1 - 8
  • [37] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
    Liu, Zikang
    Chen, Sihan
    Guo, Longteng
    Li, Handong
    He, Xingjian
    Liu, Jing
    [J]. arXiv, 2023,
  • [38] VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
    Bao, Hangbo
    Wang, Wenhui
    Dong, Li
    Liu, Qiang
    Mohammed, Owais Khan
    Aggarwal, Kriti
    Som, Subhojit
    Piao, Songhao
    Wei, Furu
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [39] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
    Liu, Zikang
    Chen, Sihan
    Guo, Longteng
    Li, Handong
    He, Xingjian
    Liu, Jing
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5120 - 5131
  • [40] MAKE: Vision-Language Pre-training based Product Retrieval in Taobao Search
    Zheng, Xiaoyang
    Wang, Zilong
    Li, Sen
    Xu, Ke
    Zhuang, Tao
    Liu, Qingwen
    Zeng, Xiaoyi
    [J]. COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023, 2023, : 356 - 360