Bilinear Graph Networks for Visual Question Answering

被引：28

作者：

Guo, Dalu ^{[1
]}

Xu, Chang ^{[1
]}

Tao, Dacheng ^{[1
,2
]}

机构：

[1] Univ Sydney, Sch Comp Sci, Fac Engn, Sydney, NSW 2008, Australia

[2] JD Explore Acad, Beijing 101100, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2023年 / 34卷 / 02期

基金：

澳大利亚研究理事会;

关键词：

Visualization; Feature extraction; Task analysis; Knowledge discovery; Cognition; Data models; Semantics; Bilinear graph; deep learning; graph neural networks (GNNs); visual question answering (VQA);

D O I：

10.1109/TNNLS.2021.3104937

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This article revisits the bilinear attention networks (BANs) in the visual question answering task from a graph perspective. The classical BANs build a bilinear attention map to extract the joint representation of words in the question and objects in the image but lack fully exploring the relationship between words for complex reasoning. In contrast, we develop bilinear graph networks to model the context of the joint embeddings of words and objects. Two kinds of graphs are investigated, namely, image-graph and question-graph. The image-graph transfers features of the detected objects to their related query words, enabling the output nodes to have both semantic and factual information. The question-graph exchanges information between these output nodes from image-graph to amplify the implicit yet important relationship between objects. These two kinds of graphs cooperate with each other, and thus, our resulting model can build the relationship and dependency between objects, which leads to the realization of multistep reasoning. Experimental results on the VQA v2.0 validation dataset demonstrate the ability of our method to handle complex questions. On the test-std set, our best single model achieves state-of-the-art performance, boosting the overall accuracy to 72.56%, and we are one of the top-two entries in the VQA Challenge 2020.

引用

页码：1023 / 1034

页数：12

共 50 条

[41] Stacked Self-Attention Networks for Visual Question Answering
Sun, Qiang
Fu, Yanwei
[J]. ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 207 - 211
[42] Regularizing Attention Networks for Anomaly Detection in Visual Question Answering
Lee, Doyup
Cheon, Yeongjae
Han, Wook-Shin
[J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 1845 - 1853
[43] Visual Question Answering using Hierarchical Dynamic Memory Networks
Shang, Jiayu
Li, Shiren
Duan, Zhikui
Huang, Junwei
[J]. NINTH INTERNATIONAL CONFERENCE ON GRAPHIC AND IMAGE PROCESSING (ICGIP 2017), 2018, 10615
[44] Transformer Module Networks for Systematic Generalization in Visual Question Answering
Yamada, Moyuru
D'amario, Vanessa
Takemoto, Kentaro
Boix, Xavier
Sasaki, Tomotake
[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (12) : 10096 - 10105
[45] A question-guided multi-hop reasoning graph network for visual question answering
Xu, Zhaoyang
Gu, Jinguang
Liu, Maofu
Zhou, Guangyou
Fu, Haidong
Qiu, Chen
[J]. INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (02)
[46] ComQA: Compositional Question Answering via Hierarchical Graph Neural Networks
Wang, Bingning
Yao, Ting
Chen, Weipeng
Xu, Jingfang
Wang, Xiaochuan
[J]. PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2021 (WWW 2021), 2021, : 2601 - 2612
[47] Location-Aware Graph Convolutional Networks for Video Question Answering
Huang, Deng
Chen, Peihao
Zeng, Runhao
Du, Qing
Tan, Mingkui
Gan, Chuang
[J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11021 - 11028
[48] Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering
Yu, Zhou
Yu, Jun
Xiang, Chenchao
Fan, Jianping
Tao, Dacheng
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (12) : 5947 - 5959
[49] VQA: Visual Question Answering
Antol, Stanislaw
Agrawal, Aishwarya
Lu, Jiasen
Mitchell, Margaret
Batra, Dhruv
Zitnick, C. Lawrence
Parikh, Devi
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
[50] Visual question answering model based on graph neural network and contextual attention
Sharma, Himanshu
Jalal, Anand Singh
[J]. IMAGE AND VISION COMPUTING, 2021, 110

← 1 2 3 4 5 →