Multimodal Encoder-Decoder Attention Networks for Visual Question Answering

被引:40
|
作者
Chen, Chongqing [1 ]
Han, Dezhi [1 ]
Wang, Jun [2 ]
机构
[1] Shanghai Maritime Univ, Sch Informat Engn, Shanghai 201306, Peoples R China
[2] Univ Cent Florida, Dept ECE, Orlando, FL 32816 USA
来源
IEEE ACCESS | 2020年 / 8卷
基金
中国国家自然科学基金;
关键词
Feature extraction; Visualization; Decoding; Task analysis; Knowledge discovery; Natural language processing; Benchmark testing; Computer vision; encoder-decoder attention; multimodal task; natural language processing; question-guided-attention; self-attention; visual question answering;
D O I
10.1109/ACCESS.2020.2975093
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual Question Answering (VQA) is a multimodal task involving Computer Vision (CV) and Natural Language Processing (NLP), the goal is to establish a high-efficiency VQA model. Learning a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions is the heart of VQA. In this paper, a novel Multimodal Encoder-Decoder Attention Networks (MEDAN) is proposed. The MEDAN consists of Multimodal Encoder-Decoder Attention (MEDA) layers cascaded in depth, and can capture rich and reasonable question features and image features by associating keywords in question with important object regions in image. Each MEDA layer contains an Encoder module modeling the self-attention of questions, as well as a Decoder module modeling the question-guided-attention and self-attention of images. Experimental evaluation results on the benchmark VQA-v2 dataset demonstrate that MEDAN achieves state-of-the-art VQA performance. With the Adam solver, our best single model delivers 71.01% overall accuracy on the test-std set, and with the AdamW solver, we achieve an overall accuracy of 70.76% on the test-dev set. Additionally, extensive ablation studies are conducted to explore the reasons for MEDANs effectiveness.
引用
收藏
页码:35662 / 35671
页数:10
相关论文
共 50 条
  • [1] Attention-based encoder-decoder model for answer selection in question answering
    Nie, Yuan-ping
    Han, Yi
    Huang, Jiu-ming
    Jiao, Bo
    Li, Ai-ping
    [J]. FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2017, 18 (04) : 535 - 544
  • [2] Attention-based encoder-decoder model for answer selection in question answering
    Yuan-ping Nie
    Yi Han
    Jiu-ming Huang
    Bo Jiao
    Ai-ping Li
    [J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18 : 535 - 544
  • [3] Encoder-decoder cycle for visual question answering based on perception-action cycle
    Mohamud, Safaa Abdullahi Moallim
    Jalali, Amin
    Lee, Minho
    [J]. PATTERN RECOGNITION, 2023, 144
  • [4] Timber Tracing with Multimodal Encoder-Decoder Networks
    Zolotarev, Fedor
    Eerola, Tuomas
    Lensu, Lasse
    Kalviainen, Heikki
    Haario, Heikki
    Heikkinen, Jere
    Kauppi, Tomi
    [J]. COMPUTER ANALYSIS OF IMAGES AND PATTERNS, CAIP 2019, PT II, 2019, 11679 : 342 - 353
  • [5] Multimodal Attention for Visual Question Answering
    Kodra, Lorena
    Mece, Elinda Kajo
    [J]. INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
  • [6] Multimodal Cross-guided Attention Networks for Visual Question Answering
    Liu, Haibin
    Gong, Shengrong
    Ji, Yi
    Yang, Jianyu
    Xing, Tengfei
    Liu, Chunping
    [J]. PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON COMPUTER MODELING, SIMULATION AND ALGORITHM (CMSA 2018), 2018, 151 : 347 - 353
  • [7] Deep Convolutional Symmetric Encoder-Decoder Neural Networks to Predict Students' Visual Attention
    Hachaj, Tomasz
    Stolinska, Anna
    Andrzejewska, Magdalena
    Czerski, Piotr
    [J]. SYMMETRY-BASEL, 2021, 13 (12):
  • [8] Attention-based encoder-decoder networks for workflow recognition
    Zhang, Min
    Hu, Haiyang
    Li, Zhongjin
    Chen, Jie
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (28-29) : 34973 - 34995
  • [9] Video Summarization With Attention-Based Encoder-Decoder Networks
    Ji, Zhong
    Xiong, Kailin
    Pang, Yanwei
    Li, Xuelong
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (06) : 1709 - 1717
  • [10] Attention-based encoder-decoder networks for workflow recognition
    Min Zhang
    Haiyang Hu
    Zhongjin Li
    Jie Chen
    [J]. Multimedia Tools and Applications, 2021, 80 : 34973 - 34995