Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism

被引:4
|
作者
Xia, Qihao [1 ]
Yu, Chao [1 ,2 ,3 ]
Hou, Yinong [1 ]
Peng, Pingping [1 ]
Zheng, Zhengqi [1 ,2 ]
Chen, Wen [1 ,2 ,3 ]
机构
[1] East China Normal Univ, Engn Ctr SHMEC Space Informat & GNSS, Shanghai 200241, Peoples R China
[2] East China Normal Univ, Shanghai Key Lab Multidimens Informat Proc, Shanghai 200241, Peoples R China
[3] East China Normal Univ, Key Lab Geog Informat Sci, Minist Educ, Shanghai 200241, Peoples R China
基金
中国国家自然科学基金;
关键词
multi-modal alignment; multi-hop attention; visual question answering; feature fusion; SIGMOID FUNCTION; MODEL;
D O I
10.3390/electronics11111778
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The alignment of information between the image and the question is of great significance in the visual question answering (VQA) task. Self-attention is commonly used to generate attention weights between image and question. These attention weights can align two modalities. Through the attention weight, the model can select the relevant area of the image to align with the question. However, when using the self-attention mechanism, the attention weight between two objects is only determined by the representation of these two objects. It ignores the influence of other objects around these two objects. This contribution proposes a novel multi-hop attention alignment method that enriches surrounding information when using self-attention to align two modalities. Simultaneously, in order to utilize position information in alignment, we also propose a position embedding mechanism. The position embedding mechanism extracts the position information of each object and implements the position embedding mechanism to align the question word with the correct position in the image. According to the experiment on the VQA2.0 dataset, our model achieves validation accuracy of 65.77%, outperforming several state-of-the-art methods. The experimental result shows that our proposed methods have better performance and effectiveness.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Adversarial Learning With Multi-Modal Attention for Visual Question Answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Cheng, Lei
    Li, Zhoujun
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (09) : 3894 - 3908
  • [2] Multi-modal adaptive gated mechanism for visual question answering
    Xu, Yangshuyi
    Zhang, Lin
    Shen, Xiang
    [J]. PLOS ONE, 2023, 18 (06):
  • [3] Multi-modal spatial relational attention networks for visual question answering
    Yao, Haibo
    Wang, Lipeng
    Cai, Chengtao
    Sun, Yuxin
    Zhang, Zhi
    Luo, Yongkang
    [J]. IMAGE AND VISION COMPUTING, 2023, 140
  • [4] The multi-modal fusion in visual question answering: a review of attention mechanisms
    Lu, Siyu
    Liu, Mingzhe
    Yin, Lirong
    Yin, Zhengtong
    Liu, Xuan
    Zheng, Wenfeng
    [J]. PEERJ COMPUTER SCIENCE, 2023, 9
  • [5] Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering
    Guo, Zihan
    Han, Dezhi
    [J]. SENSORS, 2020, 20 (23) : 1 - 15
  • [6] Multi-hop Question Answering
    Mavi, Vaibhav
    Jangra, Anubhav
    Jatowt, Adam
    [J]. FOUNDATIONS AND TRENDS IN INFORMATION RETRIEVAL, 2023, 17 (05): : 457 - 586
  • [7] Discovering Multimodal Hierarchical Structures with Graph Neural Networks for Multi-modal and Multi-hop Question Answering
    Zhang, Qing
    Lv, Haocheng
    Liu, Jie
    Chen, Zhiyun
    Duan, Jianyong
    Xv, Mingying
    Wang, Hao
    [J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 383 - 394
  • [8] Multi-modal co-attention relation networks for visual question answering
    Zihan Guo
    Dezhi Han
    [J]. The Visual Computer, 2023, 39 : 5783 - 5795
  • [9] Multi-modal co-attention relation networks for visual question answering
    Guo, Zihan
    Han, Dezhi
    [J]. VISUAL COMPUTER, 2023, 39 (11): : 5783 - 5795
  • [10] Enhancing Multi-modal Multi-hop Question Answering via Structured Knowledge and Unified Retrieval-Generation
    Yang, Qian
    Chen, Qian
    Wang, Wen
    Hu, Baotian
    Zhang, Min
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5223 - 5234