MM-Reasoner: A Multi-Modal Knowledge-Aware Framework for Knowledge-Based Visual Question Answering

被引:0
|
作者
Khademi, Mahmoud [1 ]
Yang, Ziyi [1 ]
Frujeri, Felipe Vieira [1 ]
Zhu, Chenguang [1 ]
机构
[1] Microsoft Cognit Serv Res Grp, Redmond, WA 98052 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Thanks to the strong reasoning capabilities of Large Language Models (LLMs), recent approaches to knowledge-based visual question answering (KVQA) utilize LLMs with a global caption of an input image to answer a question. However, these approaches may miss key visual information that is not captured by the caption. Moreover, they cannot fully utilize the visual information required to answer the question. To address these issues, we introduce a new framework called Multi-Modal Knowledge-Aware Reasoner (MM-Reasoner) for KVQA. MM-Reasoner first utilizes a set of vision APIs, such as dense captioners, object detectors, and OCR, to extract detailed information from the image in textual format. Then, it prompts an LLM to extract query-specific knowledge from the extracted textual information to provide a rich representation that contains external knowledge, commonsense, explicit supporting facts, and rationales required for reasoning. Finally, the knowledge, query, and visual input are used to fine-tune a Vision-Language Model (VLM). At test time, MM-Reasoner uses the potential answers predicted by the VLM to iteratively update and optimize the prompt, refining its answer. Empirical studies show that MM-Reasoner achieves state-of-the-art performance on several KVQA datasets.
引用
收藏
页码:6571 / 6581
页数:11
相关论文
共 50 条
  • [21] Explicit Knowledge-based Reasoning for Visual Question Answering
    Wang, Peng
    Wu, Qi
    Shen, Chunhua
    Dick, Anthony
    van den Hengel, Anton
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 1290 - 1296
  • [22] Multi-modal Knowledge-aware Reinforcement Learning Network for Explainable Recommendation
    Tao, Shaohua
    Qiu, Runhe
    Ping, Yuan
    Ma, Hui
    KNOWLEDGE-BASED SYSTEMS, 2021, 227
  • [23] Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering
    Feng, Yanlin
    Chen, Xinyue
    Lin, Bill Yuchen
    Wang, Peifeng
    Yan, Jun
    Ren, Xiang
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 1295 - 1309
  • [24] A Simple Baseline for Knowledge-Based Visual Question Answering
    Xenos, Alexandros
    Stafylakis, Themos
    Patras, Ioannis
    Tzimiropoulos, Georgios
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 14871 - 14877
  • [25] Knowledge-based question answering
    Rinaldi, F
    Dowdall, J
    Hess, M
    Mollá, D
    Schwitter, R
    Kaljurand, K
    KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 1, PROCEEDINGS, 2003, 2773 : 785 - 792
  • [26] Knowledge-based question answering
    Hermjakob, U
    Hovy, EH
    Lin, CY
    6TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL XVI, PROCEEDINGS: COMPUTER SCIENCE III, 2002, : 66 - 71
  • [27] Improving Knowledge-Aware Dialogue Generation via Knowledge Base Question Answering
    Wang, Jian
    Liu, Junhao
    Bi, Wei
    Liu, Xiaojiang
    He, Kejing
    Xu, Ruifeng
    Yang, Min
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 9169 - 9176
  • [28] Knowledge-aware adaptive graph network for commonsense question answering
    Kang, Long
    Li, Xiaoge
    An, Xiaochun
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2024, 62 (05) : 1305 - 1324
  • [29] K-PathVQA: Knowledge-Aware Multimodal Representation for Pathology Visual Question Answering
    Naseem, Usman
    Khushi, Matloob
    Dunn, Adam G.
    Kim, Jinman
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (04) : 1886 - 1895
  • [30] The Core of Smart Cities: Knowledge Representation and Descriptive Framework Construction in Knowledge-Based Visual Question Answering
    Wang, Ruiping
    Wu, Shihong
    Wang, Xiaoping
    SUSTAINABILITY, 2022, 14 (20)