MM-Reasoner: A Multi-Modal Knowledge-Aware Framework for Knowledge-Based Visual Question Answering

被引：0

作者：

Khademi, Mahmoud ^{[1
]}

Yang, Ziyi ^{[1
]}

Frujeri, Felipe Vieira ^{[1
]}

Zhu, Chenguang ^{[1
]}

机构：

[1] Microsoft Cognit Serv Res Grp, Redmond, WA 98052 USA

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023 | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Thanks to the strong reasoning capabilities of Large Language Models (LLMs), recent approaches to knowledge-based visual question answering (KVQA) utilize LLMs with a global caption of an input image to answer a question. However, these approaches may miss key visual information that is not captured by the caption. Moreover, they cannot fully utilize the visual information required to answer the question. To address these issues, we introduce a new framework called Multi-Modal Knowledge-Aware Reasoner (MM-Reasoner) for KVQA. MM-Reasoner first utilizes a set of vision APIs, such as dense captioners, object detectors, and OCR, to extract detailed information from the image in textual format. Then, it prompts an LLM to extract query-specific knowledge from the extracted textual information to provide a rich representation that contains external knowledge, commonsense, explicit supporting facts, and rationales required for reasoning. Finally, the knowledge, query, and visual input are used to fine-tune a Vision-Language Model (VLM). At test time, MM-Reasoner uses the potential answers predicted by the VLM to iteratively update and optimize the prompt, refining its answer. Empirical studies show that MM-Reasoner achieves state-of-the-art performance on several KVQA datasets.

引用

页码：6571 / 6581

页数：11

共 50 条

[1] Multi-Modal Knowledge-Aware Attention Network for Question Answering
Zhang Y.
Qian S.
Fang Q.
Xu C.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2020, 57 (05): : 1037 - 1045
[2] Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph
Jiang, Lei
Meng, Zuqiang
ELECTRONICS, 2023, 12 (06)
[3] Multi-modal Knowledge-aware Hierarchical Attention Network for Explainable Medical Question Answering
Zhang, Yingying
Qian, Shengsheng
Fang, Quan
Xu, Changsheng
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1089 - 1097
[4] Multi-Modal Validation and Domain Interaction Learning for Knowledge-Based Visual Question Answering
Xu, Ning
Gao, Yifei
Liu, An-An
Tian, Hongshuo
Zhang, Yongdong
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (11) : 6628 - 6640
[5] KVQA: Knowledge-Aware Visual Question Answering
Shah, Sanket
Mishra, Anand
Yadati, Naganand
Talukdar, Partha Pratim
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8876 - 8884
[6] RK-VQA: Rational knowledge-aware fusion-in-decoder for knowledge-based visual question answering
Chen, Weipeng
Huang, Xu
Liu, Zifeng
Liu, Jin
Yo, Lan
INFORMATION FUSION, 2025, 118
[7] Cross-modal knowledge reasoning for knowledge-based visual question answering
Yu, Jing
Zhu, Zihao
Wang, Yujing
Zhang, Weifeng
Hu, Yue
Tan, Jianlong
PATTERN RECOGNITION, 2020, 108
[8] MKGF: A multi-modal knowledge graph based RAG framework to enhance LVLMs for Medical visual question answering
Wu, Yinan
Lu, Yuming
Zhou, Yan
Ding, Yifan
Liu, Jingping
Ruan, Tong
NEUROCOMPUTING, 2025, 635
[9] Knowledge-Aware Recommender Systems based on Multi-Modal Information Sources
Spillo, Giuseppe
PROCEEDINGS OF THE 17TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2023, 2023, : 1312 - 1317
[10] Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
Lerner, Paul
Ferret, Olivier
Guinaudeau, Camille
ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 421 - 438

← 1 2 3 4 5 →