Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering

被引:36
|
作者
Zhang, Liyang [1 ,2 ]
Liu, Shuaicheng [3 ,4 ]
Liu, Donghao [4 ]
Zeng, Pengpeng [1 ,2 ]
Li, Xiangpeng [1 ,2 ]
Song, Jingkuan [1 ,2 ]
Gao, Lianli [1 ,2 ]
机构
[1] Univ Elect Sci & Technol China, Future Media Ctr, Chengdu 611731, Peoples R China
[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
[3] Univ Elect Sci & Technol China, Sch Informat & Commun Engn, Chengdu 611731, Peoples R China
[4] Megvii Technol Ltd, Chengdu 611730, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Visualization; Knowledge based systems; Task analysis; Knowledge discovery; Semantics; Cognition; Knowledge base; object detection; self-attention; visual question answering (VQA);
D O I
10.1109/TNNLS.2020.3017530
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual question answering (VQA) that involves understanding an image and paired questions develops very quickly with the boost of deep learning in relevant research fields, such as natural language processing and computer vision. Existing works highly rely on the knowledge of the data set. However, some questions require more professional cues other than the data set knowledge to answer questions correctly. To address such an issue, we propose a novel framework named a knowledge-based augmentation network (KAN) for VQA. We introduce object-related open-domain knowledge to assist the question answering. Concretely, we extract more visual information from images and introduce a knowledge graph to provide the necessary common sense or experience for the reasoning process. For these two augmented inputs, we design an attention module that can adjust itself according to the specific questions, such that the importance of external knowledge against detected objects can be balanced adaptively. Extensive experiments show that our KAN achieves state-of-the-art performance on three challenging VQA data sets, i.e., VQA v2, VQA-CP v2, and FVQA. In addition, our open-domain knowledge is also beneficial to VQA baselines. Code is available at https://github.com/yyyanglz/KAN.
引用
收藏
页码:4362 / 4373
页数:12
相关论文
共 50 条
  • [1] Explicit Knowledge-based Reasoning for Visual Question Answering
    Wang, Peng
    Wu, Qi
    Shen, Chunhua
    Dick, Anthony
    van den Hengel, Anton
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 1290 - 1296
  • [2] Medical knowledge-based network for Patient-oriented Visual Question Answering
    Jian, Huang
    Chen, Yihao
    Yong, Li
    Yang, Zhenguo
    Gong, Xuehao
    Lee, Wang Fu
    Xu, Xiaohong
    Liu, Wenyin
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (02)
  • [3] Knowledge enhancement and scene understanding for knowledge-based visual question answering
    Zhenqiang Su
    Gang Gou
    [J]. Knowledge and Information Systems, 2024, 66 : 2193 - 2208
  • [4] Knowledge enhancement and scene understanding for knowledge-based visual question answering
    Su, Zhenqiang
    Gou, Gang
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (03) : 2193 - 2208
  • [5] Cross-modal knowledge reasoning for knowledge-based visual question answering
    Yu, Jing
    Zhu, Zihao
    Wang, Yujing
    Zhang, Weifeng
    Hu, Yue
    Tan, Jianlong
    [J]. PATTERN RECOGNITION, 2020, 108
  • [6] MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering
    Ding, Yang
    Yu, Jing
    Liu, Bang
    Hu, Yue
    Cui, Mingxin
    Wu, Qi
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5079 - 5088
  • [7] Answering knowledge-based visual questions via the exploration of Question Purpose
    Song, Lingyun
    Li, Jianao
    Liu, Jun
    Yang, Yang
    Shang, Xuequn
    Sun, Mingxuan
    [J]. PATTERN RECOGNITION, 2023, 133
  • [8] Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering
    Li, Qifeng
    Tang, Xinyi
    Jian, Yi
    [J]. SENSORS, 2022, 22 (04)
  • [9] Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
    Lerner, Paul
    Ferret, Olivier
    Guinaudeau, Camille
    [J]. ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 421 - 438
  • [10] Multimodal Inverse Cloze Task for Knowledge-Based Visual Question Answering
    Lerner, Paul
    Ferret, Olivier
    Guinaudeau, Camille
    [J]. ADVANCES IN INFORMATION RETRIEVAL, ECIR 2023, PT I, 2023, 13980 : 569 - 587