Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering

被引:1
|
作者
Salemi, Alireza [1 ]
Rafiee, Mahta [1 ]
Zamani, Hamed [1 ]
机构
[1] Univ Massachusetts, Amherst, MA 01003 USA
关键词
Dense Retrieval; Visual Question Answering; Multi-Modal Retrieval; Pre-training; Data Generation;
D O I
10.1145/3578337.3605137
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper studies a category of visual question answering tasks, in which accessing external knowledge is necessary for answering the questions. This category is called outside-knowledge visual question answering (OK-VQA). A major step in developing OKVQA systems is to retrieve relevant documents for the given multimodal query. Current state-of-the-art asymmetric dense retrieval model for this task uses an architecture with a multi-modal query encoder and a uni-modal document encoder. Such an architecture requires a large amount of training data for effective performance. We propose an automatic data generation pipeline for pre-training passage retrieval models for OK-VQA tasks. The proposed approach leads to 26.9% Precision@5 improvements compared to the current state-of-the-art asymmetric architecture. Additionally, the proposed pre-training approach exhibits a good ability in zero-shot retrieval scenarios.
引用
收藏
页码:169 / 176
页数:8
相关论文
共 50 条
  • [1] RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training
    Yuan, Zheng
    Jin, Qiao
    Tan, Chuanqi
    Zhao, Zhengyun
    Yuan, Hongyi
    Huang, Fei
    Huang, Songfang
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 547 - 556
  • [2] Passage Retrieval for Outside-Knowledge Visual Question Answering
    Qu, Chen
    Zamani, Hamed
    Yang, Liu
    Croft, W. Bruce
    Learned-Miller, Erik
    [J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1753 - 1757
  • [3] Cross-Modal Dense Passage Retrieval for Outside Knowledge Visual Question Answering
    Reichman, Benjamin
    Heck, Larry
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2829 - 2834
  • [4] Knowledge-Enhanced Visual Question Answering with Multi-modal Joint Guidance
    Wang, Jianfeng
    Zhang, Anda
    Du, Huifang
    Wang, Haofen
    Zhang, Wenqiang
    [J]. PROCEEDINGS OF THE 11TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE GRAPHS, IJCKG 2022, 2022, : 115 - 120
  • [5] Multi-Modal Contrastive Pre-training for Recommendation
    Liu, Zhuang
    Ma, Yunpu
    Schubert, Matthias
    Ouyang, Yuanxin
    Xiong, Zhang
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 99 - 108
  • [6] Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering
    Gong, Haifan
    Chen, Guanqi
    Liu, Sishuo
    Yu, Yizhou
    Li, Guanbin
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 456 - 460
  • [7] Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph
    Jiang, Lei
    Meng, Zuqiang
    [J]. ELECTRONICS, 2023, 12 (06)
  • [8] Adversarial Learning With Multi-Modal Attention for Visual Question Answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Cheng, Lei
    Li, Zhoujun
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (09) : 3894 - 3908
  • [9] Multi-modal adaptive gated mechanism for visual question answering
    Xu, Yangshuyi
    Zhang, Lin
    Shen, Xiang
    [J]. PLOS ONE, 2023, 18 (06):
  • [10] MULTI-MODAL PRE-TRAINING FOR AUTOMATED SPEECH RECOGNITION
    Chan, David M.
    Ghosh, Shalini
    Chakrabarty, Debmalya
    Hoffmeister, Bjorn
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 246 - 250