Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering

被引：36

作者：

Zhang, Liyang ^{[1
,2
]}

Liu, Shuaicheng ^{[3
,4
]}

Liu, Donghao ^{[4
]}

Zeng, Pengpeng ^{[1
,2
]}

Li, Xiangpeng ^{[1
,2
]}

Song, Jingkuan ^{[1
,2
]}

Gao, Lianli ^{[1
,2
]}

机构：

[1] Univ Elect Sci & Technol China, Future Media Ctr, Chengdu 611731, Peoples R China

[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China

[3] Univ Elect Sci & Technol China, Sch Informat & Commun Engn, Chengdu 611731, Peoples R China

[4] Megvii Technol Ltd, Chengdu 611730, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2021年 / 32卷 / 10期

基金：

中国国家自然科学基金;

关键词：

Feature extraction; Visualization; Knowledge based systems; Task analysis; Knowledge discovery; Semantics; Cognition; Knowledge base; object detection; self-attention; visual question answering (VQA);

D O I：

10.1109/TNNLS.2020.3017530

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual question answering (VQA) that involves understanding an image and paired questions develops very quickly with the boost of deep learning in relevant research fields, such as natural language processing and computer vision. Existing works highly rely on the knowledge of the data set. However, some questions require more professional cues other than the data set knowledge to answer questions correctly. To address such an issue, we propose a novel framework named a knowledge-based augmentation network (KAN) for VQA. We introduce object-related open-domain knowledge to assist the question answering. Concretely, we extract more visual information from images and introduce a knowledge graph to provide the necessary common sense or experience for the reasoning process. For these two augmented inputs, we design an attention module that can adjust itself according to the specific questions, such that the importance of external knowledge against detected objects can be balanced adaptively. Extensive experiments show that our KAN achieves state-of-the-art performance on three challenging VQA data sets, i.e., VQA v2, VQA-CP v2, and FVQA. In addition, our open-domain knowledge is also beneficial to VQA baselines. Code is available at https://github.com/yyyanglz/KAN.

引用

页码：4362 / 4373

页数：12

共 50 条

[1] Explicit Knowledge-based Reasoning for Visual Question Answering
Wang, Peng
Wu, Qi
Shen, Chunhua
Dick, Anthony
van den Hengel, Anton
[J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 1290 - 1296
[2] Medical knowledge-based network for Patient-oriented Visual Question Answering
Jian, Huang
Chen, Yihao
Yong, Li
Yang, Zhenguo
Gong, Xuehao
Lee, Wang Fu
Xu, Xiaohong
Liu, Wenyin
[J]. INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (02)
[3] Knowledge enhancement and scene understanding for knowledge-based visual question answering
Zhenqiang Su
Gang Gou
[J]. Knowledge and Information Systems, 2024, 66 : 2193 - 2208
[4] Knowledge enhancement and scene understanding for knowledge-based visual question answering
Su, Zhenqiang
Gou, Gang
[J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (03) : 2193 - 2208
[5] Cross-modal knowledge reasoning for knowledge-based visual question answering
Yu, Jing
Zhu, Zihao
Wang, Yujing
Zhang, Weifeng
Hu, Yue
Tan, Jianlong
[J]. PATTERN RECOGNITION, 2020, 108
[6] MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering
Ding, Yang
Yu, Jing
Liu, Bang
Hu, Yue
Cui, Mingxin
Wu, Qi
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5079 - 5088
[7] Answering knowledge-based visual questions via the exploration of Question Purpose
Song, Lingyun
Li, Jianao
Liu, Jun
Yang, Yang
Shang, Xuequn
Sun, Mingxuan
[J]. PATTERN RECOGNITION, 2023, 133
[8] Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering
Li, Qifeng
Tang, Xinyi
Jian, Yi
[J]. SENSORS, 2022, 22 (04)
[9] Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
Lerner, Paul
Ferret, Olivier
Guinaudeau, Camille
[J]. ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 421 - 438
[10] Multimodal Inverse Cloze Task for Knowledge-Based Visual Question Answering
Lerner, Paul
Ferret, Olivier
Guinaudeau, Camille
[J]. ADVANCES IN INFORMATION RETRIEVAL, ECIR 2023, PT I, 2023, 13980 : 569 - 587

← 1 2 3 4 5 →