Multi-scale relation reasoning for multi-modal Visual Question Answering

被引:35
|
作者
Wu, Yirui [1 ]
Ma, Yuntao [2 ]
Wan, Shaohua [3 ]
机构
[1] Hohai Univ, Coll Comp & Informat, Fochengxi Rd, Nanjing 210093, Peoples R China
[2] Nanjing Univ, Natl Key Lab Novel Software Technol, Xianling Rd, Nanjing 210093, Peoples R China
[3] Zhongnan Univ Econ & Law, Sch Informat & Safety Engn, Wuhan, Peoples R China
基金
国家重点研发计划;
关键词
Multi-modal data; Visual Question Answering; Multi-scale relation reasoning; Attention model;
D O I
10.1016/j.image.2021.116319
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The goal of Visual Question Answering (VQA) is to answer questions about images. For the same picture, there are often completely different types of questions. Therefore, the main difficulty of VQA task lies in how to properly reason relationships among multiple visual objects according to types of input questions. To solve this difficulty, this paper proposes a deep neural network to perform multi-modal relation reasoning in multi-scales, which successfully constructs a regional attention scheme to focus on informative and question-related regions for better answering. Specifically, we firstly design regional attention scheme to select regions of interest based on informative evaluation computed by a question-guided soft attention module. Afterwards, features computed by regional attention scheme are fused in scaled combinations, thus generating more distinctive features with scalable information. Due to designs of regional attention and multi-scale property, the proposed method is capable to describe scaled relationships from multi-modal inputs to offer accurate question-guided answers. By conducting experiments on VQA v1 and VQA v2 datasets, we show that the proposed method has superior efficiencies than most of the existing methods.
引用
收藏
页数:9
相关论文
共 50 条
  • [31] Multi-modal and multi-scale retinal imaging with angiography
    Shirazi, Muhammad Faizan
    Andilla, Jordi
    Cunquero, Marina
    Lefaudeux, Nicolas
    De Jesus, Danilo Andrade
    Brea, Luisa Sanchez
    Klein, Stefan
    van Walsum, Theo
    Grieve, Kate
    Paques, Michel
    Torm, Marie Elise Wistrup
    Larsen, Michael
    Loza-Alvarez, Pablo
    Levecq, Xavier
    Chateau, Nicolas
    Pircher, Michael
    [J]. INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2021, 62 (08)
  • [32] Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network
    Liu, Meng
    Zhang, Fenglei
    Luo, Xin
    Liu, Fan
    Wei, Yinwei
    Nie, Liqiang
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3985 - 3993
  • [33] Holistic Multi-Modal Memory Network for Movie Question Answering
    Wang, Anran
    Anh Tuan Luu
    Foo, Chuan-Sheng
    Zhu, Hongyuan
    Tay, Yi
    Chandrasekhar, Vijay
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 489 - 499
  • [34] A multi-scale contextual attention network for remote sensing visual question answering
    Feng, Jiangfan
    Wang, Hui
    [J]. INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 126
  • [35] Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering
    Yu, Zhou
    Yu, Jun
    Fan, Jianping
    Tao, Dacheng
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1839 - 1848
  • [36] Answer-checking in Context: A Multi-modal Fully Attention Network for Visual Question Answering
    Huang, Hantao
    Han, Tao
    Han, Wei
    Yap, Deep
    Chiang, Cheng-Ming
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 1173 - 1180
  • [37] Interpretable medical image Visual Question Answering via multi-modal relationship graph learning
    Hu, Xinyue
    Gu, Lin
    Kobayashi, Kazuma
    Liu, Liangchen
    Zhang, Mengliang
    Harada, Tatsuya
    Summers, Ronald M.
    Zhu, Yingying
    [J]. MEDICAL IMAGE ANALYSIS, 2024, 97
  • [38] NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario
    Qian, Tianwen
    Chen, Jingjing
    Zhuo, Linhai
    Jiao, Yang
    Jiang, Yu-Gang
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 4542 - 4550
  • [39] Medical Visual Question-Answering Model Based on Knowledge Enhancement and Multi-Modal Fusion
    Zhang, Dianyuan
    Yu, Chuanming
    An, Lu
    [J]. Proceedings of the Association for Information Science and Technology, 2024, 61 (01) : 703 - 708
  • [40] A Multi-scale and Multi-modal Transportation GIS for the City of Guangzhou
    Chen, Shaopei
    Claramunt, Christophe
    Ray, Cyril
    Tan, Jianjun
    [J]. INFORMATION FUSION AND GEOGRAPHIC INFORMATION SYSTEMS, PROCEEDINGS, 2009, : 95 - 111