Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering

被引:20
|
作者
Lao, Mingrui [1 ]
Guo, Yanming [1 ]
Wang, Hui [1 ]
Zhang, Xin [1 ]
机构
[1] Natl Univ Def Technol, Coll Syst Engn, Changsha 410073, Hunan, Peoples R China
来源
IEEE ACCESS | 2018年 / 6卷
基金
中国国家自然科学基金;
关键词
Visual question answering; cross-modal multistep fusion network; attention mechanism;
D O I
10.1109/ACCESS.2018.2844789
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual question answering (VQA) is receiving increasing attention from researchers in both the computer vision and natural language processing fields. There are two key components in the VQA task: feature extraction and multi-modal fusion. For feature extraction, we introduce a novel co-attention scheme by combining Sentence-guide Word Attention (SWA) and Question-guide Image Attention in a unified framework. To be specific, the textual attention SWA relies on the semantics of the whole question sentence to calculate contributions of different question words for text representation. For the multi-modal fusion, we propose a "Cross-modal Multistep Fusion (CMF)'' network to generate multistep features and achieve multiple interactions for two modalities, rather than focusing on modeling complex interactions between two modals like most current feature fusion methods. To avoid the linear increase of the computational cost, we share the parameters for each step in the CMF. Extensive experiments demonstrate that the proposed method can achieve competitive or better performance than the state of the art.
引用
收藏
页码:31516 / 31524
页数:9
相关论文
共 50 条
  • [1] Co-Attention Network With Question Type for Visual Question Answering
    Yang, Chao
    Jiang, Mengqi
    Jiang, Bin
    Zhou, Weixin
    Li, Keqin
    [J]. IEEE ACCESS, 2019, 7 : 40771 - 40781
  • [2] Dynamic Co-attention Network for Visual Question Answering
    Ebaid, Doaa B.
    Madbouly, Magda M.
    El-Zoghabi, Adel A.
    [J]. 2021 8TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE (ISCMI 2021), 2021, : 125 - 129
  • [3] Co-attention Network for Visual Question Answering Based on Dual Attention
    Dong, Feng
    Wang, Xiaofeng
    Oad, Ammar
    Talpur, Mir Sajjad Hussain
    [J]. Journal of Engineering Science and Technology Review, 2021, 14 (06) : 116 - 123
  • [4] Co-attention graph convolutional network for visual question answering
    Liu, Chuan
    Tan, Ying-Ying
    Xia, Tian-Tian
    Zhang, Jiajing
    Zhu, Ming
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2527 - 2543
  • [5] Co-attention graph convolutional network for visual question answering
    Chuan Liu
    Ying-Ying Tan
    Tian-Tian Xia
    Jiajing Zhang
    Ming Zhu
    [J]. Multimedia Systems, 2023, 29 : 2527 - 2543
  • [6] Multi-modal co-attention relation networks for visual question answering
    Zihan Guo
    Dezhi Han
    [J]. The Visual Computer, 2023, 39 : 5783 - 5795
  • [7] Cross-modality co-attention networks for visual question answering
    Han, Dezhi
    Zhou, Shuli
    Li, Kuan Ching
    de Mello, Rodrigo Fernandes
    [J]. SOFT COMPUTING, 2021, 25 (07) : 5411 - 5421
  • [8] Cross-modal Relational Reasoning Network for Visual Question Answering
    Chen, Hongyu
    Liu, Ruifang
    Peng, Bo
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3939 - 3948
  • [9] Multi-modal co-attention relation networks for visual question answering
    Guo, Zihan
    Han, Dezhi
    [J]. VISUAL COMPUTER, 2023, 39 (11): : 5783 - 5795
  • [10] Multi-Channel Co-Attention Network for Visual Question Answering
    Tian, Weidong
    He, Bin
    Wang, Nanxun
    Zhao, Zhongqiu
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,