MM-Reasoner: A Multi-Modal Knowledge-Aware Framework for Knowledge-Based Visual Question Answering

被引:0
|
作者
Khademi, Mahmoud [1 ]
Yang, Ziyi [1 ]
Frujeri, Felipe Vieira [1 ]
Zhu, Chenguang [1 ]
机构
[1] Microsoft Cognit Serv Res Grp, Redmond, WA 98052 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Thanks to the strong reasoning capabilities of Large Language Models (LLMs), recent approaches to knowledge-based visual question answering (KVQA) utilize LLMs with a global caption of an input image to answer a question. However, these approaches may miss key visual information that is not captured by the caption. Moreover, they cannot fully utilize the visual information required to answer the question. To address these issues, we introduce a new framework called Multi-Modal Knowledge-Aware Reasoner (MM-Reasoner) for KVQA. MM-Reasoner first utilizes a set of vision APIs, such as dense captioners, object detectors, and OCR, to extract detailed information from the image in textual format. Then, it prompts an LLM to extract query-specific knowledge from the extracted textual information to provide a rich representation that contains external knowledge, commonsense, explicit supporting facts, and rationales required for reasoning. Finally, the knowledge, query, and visual input are used to fine-tune a Vision-Language Model (VLM). At test time, MM-Reasoner uses the potential answers predicted by the VLM to iteratively update and optimize the prompt, refining its answer. Empirical studies show that MM-Reasoner achieves state-of-the-art performance on several KVQA datasets.
引用
收藏
页码:6571 / 6581
页数:11
相关论文
共 50 条
  • [1] Multi-Modal Knowledge-Aware Attention Network for Question Answering
    Zhang Y.
    Qian S.
    Fang Q.
    Xu C.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2020, 57 (05): : 1037 - 1045
  • [2] Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph
    Jiang, Lei
    Meng, Zuqiang
    ELECTRONICS, 2023, 12 (06)
  • [3] Multi-modal Knowledge-aware Hierarchical Attention Network for Explainable Medical Question Answering
    Zhang, Yingying
    Qian, Shengsheng
    Fang, Quan
    Xu, Changsheng
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1089 - 1097
  • [4] Multi-Modal Validation and Domain Interaction Learning for Knowledge-Based Visual Question Answering
    Xu, Ning
    Gao, Yifei
    Liu, An-An
    Tian, Hongshuo
    Zhang, Yongdong
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (11) : 6628 - 6640
  • [5] KVQA: Knowledge-Aware Visual Question Answering
    Shah, Sanket
    Mishra, Anand
    Yadati, Naganand
    Talukdar, Partha Pratim
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8876 - 8884
  • [6] RK-VQA: Rational knowledge-aware fusion-in-decoder for knowledge-based visual question answering
    Chen, Weipeng
    Huang, Xu
    Liu, Zifeng
    Liu, Jin
    Yo, Lan
    INFORMATION FUSION, 2025, 118
  • [7] Cross-modal knowledge reasoning for knowledge-based visual question answering
    Yu, Jing
    Zhu, Zihao
    Wang, Yujing
    Zhang, Weifeng
    Hu, Yue
    Tan, Jianlong
    PATTERN RECOGNITION, 2020, 108
  • [8] MKGF: A multi-modal knowledge graph based RAG framework to enhance LVLMs for Medical visual question answering
    Wu, Yinan
    Lu, Yuming
    Zhou, Yan
    Ding, Yifan
    Liu, Jingping
    Ruan, Tong
    NEUROCOMPUTING, 2025, 635
  • [9] Knowledge-Aware Recommender Systems based on Multi-Modal Information Sources
    Spillo, Giuseppe
    PROCEEDINGS OF THE 17TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2023, 2023, : 1312 - 1317
  • [10] Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
    Lerner, Paul
    Ferret, Olivier
    Guinaudeau, Camille
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 421 - 438