Multi-scale relation reasoning for multi-modal Visual Question Answering

被引:35
|
作者
Wu, Yirui [1 ]
Ma, Yuntao [2 ]
Wan, Shaohua [3 ]
机构
[1] Hohai Univ, Coll Comp & Informat, Fochengxi Rd, Nanjing 210093, Peoples R China
[2] Nanjing Univ, Natl Key Lab Novel Software Technol, Xianling Rd, Nanjing 210093, Peoples R China
[3] Zhongnan Univ Econ & Law, Sch Informat & Safety Engn, Wuhan, Peoples R China
基金
国家重点研发计划;
关键词
Multi-modal data; Visual Question Answering; Multi-scale relation reasoning; Attention model;
D O I
10.1016/j.image.2021.116319
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The goal of Visual Question Answering (VQA) is to answer questions about images. For the same picture, there are often completely different types of questions. Therefore, the main difficulty of VQA task lies in how to properly reason relationships among multiple visual objects according to types of input questions. To solve this difficulty, this paper proposes a deep neural network to perform multi-modal relation reasoning in multi-scales, which successfully constructs a regional attention scheme to focus on informative and question-related regions for better answering. Specifically, we firstly design regional attention scheme to select regions of interest based on informative evaluation computed by a question-guided soft attention module. Afterwards, features computed by regional attention scheme are fused in scaled combinations, thus generating more distinctive features with scalable information. Due to designs of regional attention and multi-scale property, the proposed method is capable to describe scaled relationships from multi-modal inputs to offer accurate question-guided answers. By conducting experiments on VQA v1 and VQA v2 datasets, we show that the proposed method has superior efficiencies than most of the existing methods.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Multi-scale Relational Reasoning with Regional Attention for Visual Question Answering
    Ma, Yuntao
    Lu, Tong
    Wu, Yirui
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5642 - 5649
  • [2] Multi-modal co-attention relation networks for visual question answering
    Zihan Guo
    Dezhi Han
    [J]. The Visual Computer, 2023, 39 : 5783 - 5795
  • [3] Multi-modal co-attention relation networks for visual question answering
    Guo, Zihan
    Han, Dezhi
    [J]. VISUAL COMPUTER, 2023, 39 (11): : 5783 - 5795
  • [4] Differentiated Attention with Multi-modal Reasoning for Video Question Answering
    Yao, Shentao
    Li, Kun
    Xing, Kun
    Wu, Kewei
    Xie, Zhao
    Guo, Dan
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING, BIG DATA AND ALGORITHMS (EEBDA), 2022, : 525 - 530
  • [5] Adversarial Learning With Multi-Modal Attention for Visual Question Answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Cheng, Lei
    Li, Zhoujun
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (09) : 3894 - 3908
  • [6] Multi-modal adaptive gated mechanism for visual question answering
    Xu, Yangshuyi
    Zhang, Lin
    Shen, Xiang
    [J]. PLOS ONE, 2023, 18 (06):
  • [7] Multi-modal spatial relational attention networks for visual question answering
    Yao, Haibo
    Wang, Lipeng
    Cai, Chengtao
    Sun, Yuxin
    Zhang, Zhi
    Luo, Yongkang
    [J]. IMAGE AND VISION COMPUTING, 2023, 140
  • [8] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
    Siebert, Tim
    Clasen, Kai Norman
    Ravanbakhsh, Mahdyar
    Demir, Beguem
    [J]. IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
  • [9] The multi-modal fusion in visual question answering: a review of attention mechanisms
    Lu, Siyu
    Liu, Mingzhe
    Yin, Lirong
    Yin, Zhengtong
    Liu, Xuan
    Zheng, Wenfeng
    [J]. PEERJ COMPUTER SCIENCE, 2023, 9
  • [10] Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering
    Guo, Zihan
    Han, Dezhi
    [J]. SENSORS, 2020, 20 (23) : 1 - 15