MM-Reasoner: A Multi-Modal Knowledge-Aware Framework for Knowledge-Based Visual Question Answering

被引：0

作者：

Khademi, Mahmoud ^{[1
]}

Yang, Ziyi ^{[1
]}

Frujeri, Felipe Vieira ^{[1
]}

Zhu, Chenguang ^{[1
]}

机构：

[1] Microsoft Cognit Serv Res Grp, Redmond, WA 98052 USA

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023 | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Thanks to the strong reasoning capabilities of Large Language Models (LLMs), recent approaches to knowledge-based visual question answering (KVQA) utilize LLMs with a global caption of an input image to answer a question. However, these approaches may miss key visual information that is not captured by the caption. Moreover, they cannot fully utilize the visual information required to answer the question. To address these issues, we introduce a new framework called Multi-Modal Knowledge-Aware Reasoner (MM-Reasoner) for KVQA. MM-Reasoner first utilizes a set of vision APIs, such as dense captioners, object detectors, and OCR, to extract detailed information from the image in textual format. Then, it prompts an LLM to extract query-specific knowledge from the extracted textual information to provide a rich representation that contains external knowledge, commonsense, explicit supporting facts, and rationales required for reasoning. Finally, the knowledge, query, and visual input are used to fine-tune a Vision-Language Model (VLM). At test time, MM-Reasoner uses the potential answers predicted by the VLM to iteratively update and optimize the prompt, refining its answer. Empirical studies show that MM-Reasoner achieves state-of-the-art performance on several KVQA datasets.

引用

页码：6571 / 6581

页数：11

共 50 条

[31] Multi-modal Question Answering System Driven by Domain Knowledge Graph
Zhao, Zhengwei
Wang, Xiaodong
Xu, Xiaowei
Wang, Qing
5TH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING AND COMMUNICATIONS (BIGCOM 2019), 2019, : 43 - 47
[32] A Retriever-Reader Framework with Visual Entity Linking for Knowledge-Based Visual Question Answering
You, Jiuxiang
Yang, Zhenguo
Li, Qing
Liu, Wenyin
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 13 - 18
[33] MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering
Ding, Yang
Yu, Jing
Liu, Bang
Hu, Yue
Cui, Mingxin
Wu, Qi
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5079 - 5088
[34] Question answering with a conceptual framework for knowledge-based system development "Node of Knowledge"
Pavlic, Mile
Han, Zdravko Dovedan
Jakupovic, Alen
EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (12) : 5264 - 5286
[35] Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering
Zhang, Liyang
Liu, Shuaicheng
Liu, Donghao
Zeng, Pengpeng
Li, Xiangpeng
Song, Jingkuan
Gao, Lianli
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (10) : 4362 - 4373
[36] Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering
Salemi, Alireza
Rafiee, Mahta
Zamani, Hamed
PROCEEDINGS OF THE 2023 ACM SIGIR INTERNATIONAL CONFERENCE ON THE THEORY OF INFORMATION RETRIEVAL, ICTIR 2023, 2023, : 169 - 176
[37] Knowledge-Based Embodied Question Answering
Tan, Sinan
Ge, Mengmeng
Guo, Di
Liu, Huaping
Sun, Fuchun
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (10) : 11948 - 11960
[38] Improving Question Answering over Incomplete KBs with Knowledge-Aware Reader
Xiong, Wenhan
Yu, Mo
Chang, Shiyu
Guo, Xiaoxiao
Wang, William Yang
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 4258 - 4264
[39] RAVL: A Retrieval-Augmented Visual Language Model Framework for Knowledge-Based Visual Question Answering
Chai, Naiquan
Zou, Dongsheng
Liu, Jiyuan
Wang, Hao
Yang, Yuming
Song, Xinyi
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 394 - 406
[40] Evaluating the Adaptability of Large Language Models for Knowledge-aware Question and Answering
Thakkar, Jay
Kolekar, Suresh
Gite, Shilpa
Pradhan, Biswajeet
Alamri, Abdullah
INTERNATIONAL JOURNAL ON SMART SENSING AND INTELLIGENT SYSTEMS, 2024, 17 (01):

← 1 2 3 4 5 →