MM-Reasoner: A Multi-Modal Knowledge-Aware Framework for Knowledge-Based Visual Question Answering

被引:0
|
作者
Khademi, Mahmoud [1 ]
Yang, Ziyi [1 ]
Frujeri, Felipe Vieira [1 ]
Zhu, Chenguang [1 ]
机构
[1] Microsoft Cognit Serv Res Grp, Redmond, WA 98052 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Thanks to the strong reasoning capabilities of Large Language Models (LLMs), recent approaches to knowledge-based visual question answering (KVQA) utilize LLMs with a global caption of an input image to answer a question. However, these approaches may miss key visual information that is not captured by the caption. Moreover, they cannot fully utilize the visual information required to answer the question. To address these issues, we introduce a new framework called Multi-Modal Knowledge-Aware Reasoner (MM-Reasoner) for KVQA. MM-Reasoner first utilizes a set of vision APIs, such as dense captioners, object detectors, and OCR, to extract detailed information from the image in textual format. Then, it prompts an LLM to extract query-specific knowledge from the extracted textual information to provide a rich representation that contains external knowledge, commonsense, explicit supporting facts, and rationales required for reasoning. Finally, the knowledge, query, and visual input are used to fine-tune a Vision-Language Model (VLM). At test time, MM-Reasoner uses the potential answers predicted by the VLM to iteratively update and optimize the prompt, refining its answer. Empirical studies show that MM-Reasoner achieves state-of-the-art performance on several KVQA datasets.
引用
收藏
页码:6571 / 6581
页数:11
相关论文
共 50 条
  • [41] Explainable Knowledge reasoning via thought chains for knowledge-based visual question answering
    Qiu, Chen
    Xie, Zhiqiang
    Liu, Maofu
    Hu, Huijun
    INFORMATION PROCESSING & MANAGEMENT, 2024, 61 (04)
  • [42] Multi-Modal Answer Validation for Knowledge-Based VQA
    Wu, Jialin
    Lu, Jiasen
    Sabharwal, Ashish
    Mottaghi, Roozbeh
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 2712 - 2721
  • [43] Knowledge-aware Multi-modal Adaptive Graph Convolutional Networks for Fake News Detection
    Qian, Shengsheng
    Hu, Jun
    Fang, Quan
    Xu, Changsheng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (03)
  • [44] Multi-modal Knowledge-aware Event Memory Network for Social Media Rumor Detection
    Zhang, Huaiwen
    Fang, Quan
    Qian, Shengsheng
    Xu, Changsheng
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1942 - 1951
  • [45] REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering
    Lin, Yuanze
    Xie, Yujia
    Chen, Dongdong
    Xu, Yichong
    Zhu, Chenguang
    Yuan, Lu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [46] VIG: Visual Information-Guided Knowledge-Based Visual Question Answering
    Liu, Heng
    Wang, Boyue
    Sun, Yanfeng
    Li, Xiaoyan
    Hu, Yongli
    Yin, Baocai
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 1086 - 1091
  • [47] Answering knowledge-based visual questions via the exploration of Question Purpose
    Song, Lingyun
    Li, Jianao
    Liu, Jun
    Yang, Yang
    Shang, Xuequn
    Sun, Mingxuan
    PATTERN RECOGNITION, 2023, 133
  • [48] Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering
    Li, Qifeng
    Tang, Xinyi
    Jian, Yi
    SENSORS, 2022, 22 (04)
  • [49] Multimodal Inverse Cloze Task for Knowledge-Based Visual Question Answering
    Lerner, Paul
    Ferret, Olivier
    Guinaudeau, Camille
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2023, PT I, 2023, 13980 : 569 - 587
  • [50] Caption matters: a new perspective for knowledge-based visual question answering
    Feng, Bin
    Ruan, Shulan
    Wu, Likang
    Liu, Huijie
    Zhang, Kai
    Zhang, Kun
    Liu, Qi
    Chen, Enhong
    KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (11) : 6975 - 7003