Bilinear Graph Networks for Visual Question Answering

被引:28
|
作者
Guo, Dalu [1 ]
Xu, Chang [1 ]
Tao, Dacheng [1 ,2 ]
机构
[1] Univ Sydney, Sch Comp Sci, Fac Engn, Sydney, NSW 2008, Australia
[2] JD Explore Acad, Beijing 101100, Peoples R China
基金
澳大利亚研究理事会;
关键词
Visualization; Feature extraction; Task analysis; Knowledge discovery; Cognition; Data models; Semantics; Bilinear graph; deep learning; graph neural networks (GNNs); visual question answering (VQA);
D O I
10.1109/TNNLS.2021.3104937
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article revisits the bilinear attention networks (BANs) in the visual question answering task from a graph perspective. The classical BANs build a bilinear attention map to extract the joint representation of words in the question and objects in the image but lack fully exploring the relationship between words for complex reasoning. In contrast, we develop bilinear graph networks to model the context of the joint embeddings of words and objects. Two kinds of graphs are investigated, namely, image-graph and question-graph. The image-graph transfers features of the detected objects to their related query words, enabling the output nodes to have both semantic and factual information. The question-graph exchanges information between these output nodes from image-graph to amplify the implicit yet important relationship between objects. These two kinds of graphs cooperate with each other, and thus, our resulting model can build the relationship and dependency between objects, which leads to the realization of multistep reasoning. Experimental results on the VQA v2.0 validation dataset demonstrate the ability of our method to handle complex questions. On the test-std set, our best single model achieves state-of-the-art performance, boosting the overall accuracy to 72.56%, and we are one of the top-two entries in the VQA Challenge 2020.
引用
收藏
页码:1023 / 1034
页数:12
相关论文
共 50 条
  • [41] Stacked Self-Attention Networks for Visual Question Answering
    Sun, Qiang
    Fu, Yanwei
    [J]. ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 207 - 211
  • [42] Regularizing Attention Networks for Anomaly Detection in Visual Question Answering
    Lee, Doyup
    Cheon, Yeongjae
    Han, Wook-Shin
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 1845 - 1853
  • [43] Visual Question Answering using Hierarchical Dynamic Memory Networks
    Shang, Jiayu
    Li, Shiren
    Duan, Zhikui
    Huang, Junwei
    [J]. NINTH INTERNATIONAL CONFERENCE ON GRAPHIC AND IMAGE PROCESSING (ICGIP 2017), 2018, 10615
  • [44] Transformer Module Networks for Systematic Generalization in Visual Question Answering
    Yamada, Moyuru
    D'amario, Vanessa
    Takemoto, Kentaro
    Boix, Xavier
    Sasaki, Tomotake
    [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (12) : 10096 - 10105
  • [45] A question-guided multi-hop reasoning graph network for visual question answering
    Xu, Zhaoyang
    Gu, Jinguang
    Liu, Maofu
    Zhou, Guangyou
    Fu, Haidong
    Qiu, Chen
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (02)
  • [46] ComQA: Compositional Question Answering via Hierarchical Graph Neural Networks
    Wang, Bingning
    Yao, Ting
    Chen, Weipeng
    Xu, Jingfang
    Wang, Xiaochuan
    [J]. PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2021 (WWW 2021), 2021, : 2601 - 2612
  • [47] Location-Aware Graph Convolutional Networks for Video Question Answering
    Huang, Deng
    Chen, Peihao
    Zeng, Runhao
    Du, Qing
    Tan, Mingkui
    Gan, Chuang
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11021 - 11028
  • [48] Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering
    Yu, Zhou
    Yu, Jun
    Xiang, Chenchao
    Fan, Jianping
    Tao, Dacheng
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (12) : 5947 - 5959
  • [49] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [50] Visual question answering model based on graph neural network and contextual attention
    Sharma, Himanshu
    Jalal, Anand Singh
    [J]. IMAGE AND VISION COMPUTING, 2021, 110