A Multi-level Mesh Mutual Attention Model for Visual Question Answering

被引:0
|
作者
Zhi Lei
Guixian Zhang
Lijuan Wu
Kui Zhang
Rongjiao Liang
机构
[1] Guangxi Normal University,Guangxi Key Lab of Multi
来源
关键词
Visual question answering; Multi-level; Mutual attention; Multi-head;
D O I
暂无
中图分类号
学科分类号
摘要
Visual question answering is a complex multimodal task involving images and text, with broad application prospects in human–computer interaction and medical assistance. Therefore, how to deal with the feature interaction and multimodal feature fusion between the critical regions in the image and the keywords in the question is an important issue. To this end, we propose a neural network based on the encoder–decoder structure of the transformer architecture. Specifically, in the encoder, we use multi-head self-attention to mine word–word connections within question features and stack multiple layers of attention to obtain multi-level question features. We propose a mutual attention module to perform information exchange between modalities for better question features and image features representation on the decoder side. Besides, we connect the encoder and decoder in a meshed manner, perform mutual attention operations with multi-level question features, and aggregate information in an adaptive way. We propose a multi-scale fusion module in the fusion stage, which utilizes feature information at different scales to complete modal fusion. We test and validate the model effectiveness on VQA v1 and VQA v2 datasets. Our model achieves better results than state-of-the-art methods.
引用
收藏
页码:339 / 353
页数:14
相关论文
共 50 条
  • [41] MFM: A Multi-level Fused Sequence Matching Model for Candidates Filtering in Multi-paragraphs Question-Answering
    Liu, Yang
    Huang, Zhen
    Hu, Minghao
    Du, Shuyang
    Peng, Yuxing
    Li, Dongsheng
    Wang, Xu
    [J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT III, 2018, 11166 : 449 - 458
  • [42] A Multi-level Attention Model for Text Matching
    Sun, Qiang
    Wu, Yue
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2018, PT I, 2018, 11139 : 142 - 153
  • [43] Focal Visual-Text Attention for Visual Question Answering
    Liang, Junwei
    Jiang, Lu
    Cao, Liangliang
    Li, Li-Jia
    Hauptmann, Alexander
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6135 - 6143
  • [44] Multi-Question Learning for Visual Question Answering
    Lei, Chenyi
    Wu, Lei
    Liu, Dong
    Li, Zhao
    Wang, Guoxin
    Tang, Haihong
    Li, Houqiang
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11328 - 11335
  • [45] Multi-Modal fusion with multi-level attention for Visual Dialog
    Zhang, Jingping
    Wang, Qiang
    Han, Yahong
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (04)
  • [46] Visual question answering model based on graph neural network and contextual attention
    Sharma, Himanshu
    Jalal, Anand Singh
    [J]. IMAGE AND VISION COMPUTING, 2021, 110
  • [47] Generative Attention Model with Adversarial Self-learning for Visual Question Answering
    Ilievski, Ilija
    Feng, Jiashi
    [J]. PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, : 415 - 423
  • [48] Video quality enhancement based on visual attention model and multi-level exposure correction
    Guo-Shiang Lin
    Xian-Wei Ji
    [J]. Multimedia Tools and Applications, 2016, 75 : 9903 - 9925
  • [49] Video quality enhancement based on visual attention model and multi-level exposure correction
    Lin, Guo-Shiang
    Ji, Xian-Wei
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (16) : 9903 - 9925
  • [50] QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document
    Mahamoud, Ibrahim Souleiman
    Coustaty, Mickael
    Joseph, Aurelie
    d'Andecy, Vincent Poulain
    Ogier, Jean-Marc
    [J]. DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 659 - 673