Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

被引:12
|
作者
Cao, Jianjian [1 ]
Qin, Xiameng [2 ]
Zhao, Sanyuan [1 ]
Shen, Jianbing [3 ]
机构
[1] Beijing Inst Technol, Dept Comp Sci, Beijing 100081, Peoples R China
[2] Baidu Inc, Beijing 100193, Peoples R China
[3] Univ Macau, Dept Comp & Informat Sci, State Key Lab Internet Things Smart City, Macau, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Cognition; Task analysis; Semantics; Syntactics; Deep learning; Prediction algorithms; Graph matching attention (GMA); relational reasoning; visual question answering (VQA); NETWORKS;
D O I
10.1109/TNNLS.2021.3135655
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Answering semantically complicated questions according to an image is challenging in a visual question answering (VQA) task. Although the image can be well represented by deep learning, the question is always simply embedded and cannot well indicate its meaning. Besides, the visual and textual features have a gap for different modalities, it is difficult to align and utilize the cross-modality information. In this article, we focus on these two problems and propose a graph matching attention (GMA) network. First, it not only builds graph for the image but also constructs graph for the question in terms of both syntactic and embedding information. Next, we explore the intramodality relationships by a dual-stage graph encoder and then present a bilateral cross-modality GMA to infer the relationships between the image and the question. The updated cross-modality features are then sent into the answer prediction module for final answer prediction. Experiments demonstrate that our network achieves the state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset. The ablation studies verify the effectiveness of each module in our GMA network.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Cross-modality co-attention networks for visual question answering
    Dezhi Han
    Shuli Zhou
    Kuan Ching Li
    Rodrigo Fernandes de Mello
    [J]. Soft Computing, 2021, 25 : 5411 - 5421
  • [2] Cross-modality co-attention networks for visual question answering
    Han, Dezhi
    Zhou, Shuli
    Li, Kuan Ching
    de Mello, Rodrigo Fernandes
    [J]. SOFT COMPUTING, 2021, 25 (07) : 5411 - 5421
  • [3] Feature Fusion Attention Visual Question Answering
    Wang, Chunlin
    Sun, Jianyong
    Chen, Xiaolin
    [J]. ICMLC 2019: 2019 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2019, : 412 - 416
  • [4] Bridging the Cross-Modality Semantic Gap in Visual Question Answering
    Wang, Boyue
    Ma, Yujian
    Li, Xiaoyan
    Gao, Junbin
    Hu, Yongli
    Yin, Baocai
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 13
  • [5] Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering
    Fu, Ze
    Zheng, Changmeng
    Cai, Yi
    Li, Qing
    Wang, Tao
    [J]. WEB AND BIG DATA, APWEB-WAIM 2021, PT I, 2021, 12858 : 316 - 331
  • [6] Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering
    Wang, Yan
    Li, Peize
    Si, Qingyi
    Zhang, Hanwen
    Zang, Wenyu
    Lin, Zheng
    Fu, Peng
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (03)
  • [7] Multimodal feature fusion by relational reasoning and attention for visual question answering
    Zhang, Weifeng
    Yu, Jing
    Hu, Hua
    Hu, Haiyang
    Qin, Zengchang
    [J]. INFORMATION FUSION, 2020, 55 : 116 - 126
  • [8] Multi-Modality Global Fusion Attention Network for Visual Question Answering
    Yang, Cheng
    Wu, Weijia
    Wang, Yuxing
    Zhou, Hong
    [J]. ELECTRONICS, 2020, 9 (11) : 1 - 12
  • [9] Feature Enhancement in Attention for Visual Question Answering
    Lin, Yuetan
    Pang, Zhangyang
    Wang, Donghui
    Zhuang, Yueting
    [J]. PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 4216 - 4222
  • [10] CROSS-MODALITY MATCHING
    AUERBACH, C
    [J]. QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 1973, 25 (NOV): : 492 - 495