Multi-scale relation reasoning for multi-modal Visual Question Answering

被引:35
|
作者
Wu, Yirui [1 ]
Ma, Yuntao [2 ]
Wan, Shaohua [3 ]
机构
[1] Hohai Univ, Coll Comp & Informat, Fochengxi Rd, Nanjing 210093, Peoples R China
[2] Nanjing Univ, Natl Key Lab Novel Software Technol, Xianling Rd, Nanjing 210093, Peoples R China
[3] Zhongnan Univ Econ & Law, Sch Informat & Safety Engn, Wuhan, Peoples R China
基金
国家重点研发计划;
关键词
Multi-modal data; Visual Question Answering; Multi-scale relation reasoning; Attention model;
D O I
10.1016/j.image.2021.116319
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The goal of Visual Question Answering (VQA) is to answer questions about images. For the same picture, there are often completely different types of questions. Therefore, the main difficulty of VQA task lies in how to properly reason relationships among multiple visual objects according to types of input questions. To solve this difficulty, this paper proposes a deep neural network to perform multi-modal relation reasoning in multi-scales, which successfully constructs a regional attention scheme to focus on informative and question-related regions for better answering. Specifically, we firstly design regional attention scheme to select regions of interest based on informative evaluation computed by a question-guided soft attention module. Afterwards, features computed by regional attention scheme are fused in scaled combinations, thus generating more distinctive features with scalable information. Due to designs of regional attention and multi-scale property, the proposed method is capable to describe scaled relationships from multi-modal inputs to offer accurate question-guided answers. By conducting experiments on VQA v1 and VQA v2 datasets, we show that the proposed method has superior efficiencies than most of the existing methods.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] Multi-modal multi-view Bayesian semantic embedding for community question answering
    Sang, Lei
    Xu, Min
    Qian, ShengSheng
    Wu, Xindong
    [J]. NEUROCOMPUTING, 2019, 334 : 44 - 58
  • [42] Hierarchical Multi-Task Learning for Diagram Question Answering with Multi-Modal Transformer
    Yuan, Zhaoquan
    Peng, Xiao
    Wu, Xiao
    Xu, Changsheng
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1313 - 1321
  • [43] TASK-ORIENTED MULTI-MODAL QUESTION ANSWERING FOR COLLABORATIVE APPLICATIONS
    Tan, Hui Li
    Leong, Mei Chee
    Xu, Qianli
    Li, Liyuan
    Fang, Fen
    Cheng, Yi
    Gauthier, Nicolas
    Sun, Ying
    Lim, Joo Iiwee
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1426 - 1430
  • [44] MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering
    Ahmad, Mobeen
    Park, Geonwoo
    Park, Dongchan
    Park, Sanguk
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 4659 - 4664
  • [45] Multi-Modal Knowledge-Aware Attention Network for Question Answering
    基于多模态知识感知注意力机制的问答方法
    [J]. Xu, Changsheng (csxu@nlpr.ia.ac.cn), 1600, Science Press (57): : 1037 - 1045
  • [46] Multi-modal Question Answering System Driven by Domain Knowledge Graph
    Zhao, Zhengwei
    Wang, Xiaodong
    Xu, Xiaowei
    Wang, Qing
    [J]. 5TH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING AND COMMUNICATIONS (BIGCOM 2019), 2019, : 43 - 47
  • [47] Reasoning on the Relation: Enhancing Visual Representation for Visual Question Answering and Cross-Modal Retrieval
    Yu, Jing
    Zhang, Weifeng
    Lu, Yuhang
    Qin, Zengchang
    Hu, Yue
    Tan, Jianlong
    Wu, Qi
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (12) : 3196 - 3209
  • [48] The power of correlative microscopy: multi-modal, multi-scale, multi-dimensional
    Caplan, Jeffrey
    Niethammer, Marc
    Taylor, Russell M., II
    Czymmek, Kirk J.
    [J]. CURRENT OPINION IN STRUCTURAL BIOLOGY, 2011, 21 (05) : 686 - 693
  • [49] RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training
    Yuan, Zheng
    Jin, Qiao
    Tan, Chuanqi
    Zhao, Zhengyun
    Yuan, Hongyi
    Huang, Fei
    Huang, Songfang
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 547 - 556
  • [50] Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering
    Salemi, Alireza
    Rafiee, Mahta
    Zamani, Hamed
    [J]. PROCEEDINGS OF THE 2023 ACM SIGIR INTERNATIONAL CONFERENCE ON THE THEORY OF INFORMATION RETRIEVAL, ICTIR 2023, 2023, : 169 - 176