Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

被引:0
|
作者
Liu, Zikang [1 ]
Chen, Sihan [1 ]
Guo, Longteng [1 ]
Li, Handong [1 ]
He, Xingjian [1 ]
Liu, Jing [1 ]
机构
[1] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Vision-Language Pre-Training; Pre-Training Data Generation;
D O I
10.1145/3581783.3612388
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large pre-trained multimodal models have demonstrated significant success in a range of downstream tasks, including image captioning, image-text retrieval, visual question answering (VQA), etc. However, many of these methods rely on image-text pairs collected from the web as pre-training data and unfortunately overlook the need for fine-grained feature alignment between vision and language modalities, which requires detailed understanding of images and language expressions. While integrating VQA and dense captioning (DC) into pre-training can address this issue, acquiring image-question-answer as well as image-location-caption triplets is challenging and time-consuming. Additionally, publicly available datasets for VQA and dense captioning are typically limited in scale due to manual data collection and labeling efforts. In this paper, we propose a novel method called Joint QA and DC GEneration ( JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets. We apply this method to the Conceptual Caption (CC3M) dataset to generate a new dataset called CC3M-QA-DC. Experiments show that when used for pre-training in a multi-task manner, CC3M-QA-DC can improve the performance with various backbones on various downstream tasks. Furthermore, our generated CC3M-QA-DC can be combined with larger image-text datasets (e.g., CC15M) and achieve competitive results compared with models using much more data. Code and dataset are available at https://github.com/johncaged/OPT_Questioner.
引用
下载
收藏
页码:5120 / 5131
页数:12
相关论文
共 50 条
  • [41] GilBERT: Generative Vision-Language Pre-Training for Image-Text Retrieval
    Hong, Weixiang
    Ji, Kaixiang
    Liu, Jiajia
    Wang, Jian
    Chen, Jingdong
    Chu, Wei
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1379 - 1388
  • [42] Delving into E-Commerce Product Retrieval with Vision-Language Pre-training
    Zheng, Xiaoyang
    Lv, Fuyu
    Wang, Zilong
    Liu, Qingwen
    Zeng, Xiaoyi
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 3385 - 3389
  • [43] Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis
    Ling, Yan
    Yu, Jianfei
    Xia, Rui
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 2149 - 2159
  • [44] Multimodal detection of hateful memes by applying a vision-language pre-training model
    Chen, Yuyang
    Pan, Feng
    PLOS ONE, 2022, 17 (09):
  • [45] Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training
    Dai, Wenliang
    Liu, Zihan
    Ji, Ziwei
    Su, Dan
    Fung, Pascale
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2136 - 2148
  • [46] BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
    Li, Junnan
    Li, Dongxu
    Xiong, Caiming
    Hoi, Steven
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [47] VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
    Wang, Teng
    Jiang, Wenhao
    Lu, Zhichao
    Zheng, Feng
    Cheng, Ran
    Yin, Chengguo
    Luo, Ping
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [48] Patch is enough: naturalistic adversarial patch against vision-language pre-training models
    Dehong Kong
    Siyuan Liang
    Xiaopeng Zhu
    Yuansheng Zhong
    Wenqi Ren
    Visual Intelligence, 2 (1):
  • [49] Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model
    Cheng, Kanzhi
    Song, Wenpo
    Ma, Zheng
    Zhu, Wenhao
    Zhu, Zixuan
    Zhang, Jianbing
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5038 - 5047
  • [50] Clinical-BERT: Vision-Language Pre-training for Radiograph Diagnosis and Reports Generation
    Yan, Bin
    Pei, Mingtao
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 2982 - 2990