Visual question answering with attention transfer and a cross-modal gating mechanism

被引:15
|
作者
Li, Wei [1 ]
Sun, Jianhui [1 ]
Liu, Ge [1 ]
Zhao, Linglan [1 ]
Fang, Xiangzhong [1 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Elect Engn, Shanghai 200240, Peoples R China
关键词
Attention; Visual question answering; Gating;
D O I
10.1016/j.patrec.2020.02.031
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual question answering (VQA) is challenging since it requires to understand both language information and corresponding visual contents. A lot of efforts have been made to capture single-step language and visual interactions. However, answering complex questions requires multiple steps of reasoning which gradually adjusts the region of interest to the most relevant part of the given image, which has not been well investigated. To integrate question related object relations into attention mechanism, we propose a multi-step attention architecture to facilitate the modeling of multi-modal correlations. Firstly, an attention transfer mechanism is integrated to gradually adjust the region of interest considering reasoning representation of questions. Secondly, we propose a cross-modal gating strategy to filter out irrelevant information based on multi-modal correlations. Finally, we achieve the state-of-the-art performance on the VQA 1.0 dataset and favorable results on the VQA 2.0 dataset, which verifies the effectiveness of our proposed method. (C) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页码:334 / 340
页数:7
相关论文
共 50 条
  • [1] Medical visual question answering with symmetric interaction attention and cross-modal gating
    Chen, Zhi
    Zou, Beiji
    Dai, Yulan
    Zhu, Chengzhang
    Kong, Guilan
    Zhang, Wensheng
    [J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2023, 85
  • [2] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
    Lao, Mingrui
    Guo, Yanming
    Wang, Hui
    Zhang, Xin
    [J]. IEEE ACCESS, 2018, 6 : 31516 - 31524
  • [3] Cross-Modal Visual Question Answering for Remote Sensing Data
    Felix, Rafael
    Repasky, Boris
    Hodge, Samuel
    Zolfaghari, Reza
    Abbasnejad, Ehsan
    Sherrah, Jamie
    [J]. 2021 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA 2021), 2021, : 57 - 65
  • [4] Cross-modal Relational Reasoning Network for Visual Question Answering
    Chen, Hongyu
    Liu, Ruifang
    Peng, Bo
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3939 - 3948
  • [5] Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering
    Li, Yong
    Yang, Qihao
    Wang, Fu Lee
    Lee, Lap-Kei
    Qu, Yingying
    Hao, Tianyong
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2023, 144
  • [6] Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
    Lerner, Paul
    Ferret, Olivier
    Guinaudeau, Camille
    [J]. ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 421 - 438
  • [7] Reasoning on the Relation: Enhancing Visual Representation for Visual Question Answering and Cross-Modal Retrieval
    Yu, Jing
    Zhang, Weifeng
    Lu, Yuhang
    Qin, Zengchang
    Hu, Yue
    Tan, Jianlong
    Wu, Qi
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (12) : 3196 - 3209
  • [8] Cross-modal knowledge reasoning for knowledge-based visual question answering
    Yu, Jing
    Zhu, Zihao
    Wang, Yujing
    Zhang, Weifeng
    Hu, Yue
    Tan, Jianlong
    [J]. PATTERN RECOGNITION, 2020, 108
  • [9] Cross-Modal Dense Passage Retrieval for Outside Knowledge Visual Question Answering
    Reichman, Benjamin
    Heck, Larry
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2829 - 2834
  • [10] Jointly Learning Attentions with Semantic Cross-Modal Correlation for Visual Question Answering
    Cao, Liangfu
    Gao, Lianli
    Song, Jingkuan
    Xu, Xing
    Shen, Heng Tao
    [J]. DATABASES THEORY AND APPLICATIONS, ADC 2017, 2017, 10538 : 248 - 260