A Multi-level Mesh Mutual Attention Model for Visual Question Answering

被引:0
|
作者
Zhi Lei
Guixian Zhang
Lijuan Wu
Kui Zhang
Rongjiao Liang
机构
[1] Guangxi Normal University,Guangxi Key Lab of Multi
来源
关键词
Visual question answering; Multi-level; Mutual attention; Multi-head;
D O I
暂无
中图分类号
学科分类号
摘要
Visual question answering is a complex multimodal task involving images and text, with broad application prospects in human–computer interaction and medical assistance. Therefore, how to deal with the feature interaction and multimodal feature fusion between the critical regions in the image and the keywords in the question is an important issue. To this end, we propose a neural network based on the encoder–decoder structure of the transformer architecture. Specifically, in the encoder, we use multi-head self-attention to mine word–word connections within question features and stack multiple layers of attention to obtain multi-level question features. We propose a mutual attention module to perform information exchange between modalities for better question features and image features representation on the decoder side. Besides, we connect the encoder and decoder in a meshed manner, perform mutual attention operations with multi-level question features, and aggregate information in an adaptive way. We propose a multi-scale fusion module in the fusion stage, which utilizes feature information at different scales to complete modal fusion. We test and validate the model effectiveness on VQA v1 and VQA v2 datasets. Our model achieves better results than state-of-the-art methods.
引用
收藏
页码:339 / 353
页数:14
相关论文
共 50 条
  • [1] A Multi-level Mesh Mutual Attention Model for Visual Question Answering
    Lei, Zhi
    Zhang, Guixian
    Wu, Lijuan
    Zhang, Kui
    Liang, Rongjiao
    [J]. DATA SCIENCE AND ENGINEERING, 2022, 7 (04) : 339 - 353
  • [2] Multi-level Attention Networks for Visual Question Answering
    Yu, Dongfei
    Fu, Jianlong
    Mei, Tao
    Rui, Yong
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4187 - 4195
  • [3] Multi-source Multi-level Attention Networks for Visual Question Answering
    Yu, Dongfei
    Fu, Jianlong
    Tian, Xinmei
    Mei, Tao
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (02)
  • [4] length Context-aware Multi-level Question Embedding Fusion for visual question answering
    Li, Shengdong
    Gong, Chen
    Zhu, Yuqing
    Luo, Chuanwen
    Hong, Yi
    Lv, Xueqiang
    [J]. INFORMATION FUSION, 2024, 102
  • [5] Multi-level, multi-modal interactions for visual question answering over text in images
    Jincai Chen
    Sheng Zhang
    Jiangfeng Zeng
    Fuhao Zou
    Yuan-Fang Li
    Tao Liu
    Ping Lu
    [J]. World Wide Web, 2022, 25 : 1607 - 1623
  • [6] Multi-level, multi-modal interactions for visual question answering over text in images
    Chen, Jincai
    Zhang, Sheng
    Zeng, Jiangfeng
    Zou, Fuhao
    Li, Yuan-Fang
    Liu, Tao
    Lu, Ping
    [J]. World Wide Web, 2022, 25 (04) : 1607 - 1623
  • [7] Multi-level, multi-modal interactions for visual question answering over text in images
    Chen, Jincai
    Zhang, Sheng
    Zeng, Jiangfeng
    Zou, Fuhao
    Li, Yuan-Fang
    Liu, Tao
    Lu, Ping
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2022, 25 (04): : 1607 - 1623
  • [8] Mutual Attention Inception Network for Remote Sensing Visual Question Answering
    Zheng, Xiangtao
    Wang, Binqiang
    Du, Xingqian
    Lu, Xiaoqiang
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [9] Multi-level Contrastive Learning for Commonsense Question Answering
    Fang, Quntian
    Huang, Zhen
    Zhang, Ziwen
    Hu, Minghao
    Hu, Biao
    Wang, Ankun
    Li, Dongsheng
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT IV, KSEM 2023, 2023, 14120 : 318 - 331
  • [10] Multi-grained Attention with Object-level Grounding for Visual Question Answering
    Huang, Pingping
    Huang, Jianhui
    Guo, Yuqing
    Qiao, Min
    Zhu, Yong
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3595 - 3600