DHHG-TAC: Fusion of Dynamic Heterogeneous Hypergraphs and Transformer Attention Mechanism for Visual Question Answering Tasks

被引:1
|
作者
Liu, Xuetao [1 ]
Dong, Ruiliang [1 ]
Yang, Hongyan [1 ]
机构
[1] Beijing Univ Technol, Sch Informat Sci & Technol, Beijing 100021, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Feature extraction; Visualization; Attention mechanisms; Imaging; Question answering (information retrieval); Informatics; Context modeling; Vectors; Semantics; Combined attention; hypergraph neural networks (HGNNs); visual question answering (VQA);
D O I
10.1109/TII.2024.3453919
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Amidst the burgeoning advancements in deep learning, traditional neural networks have demonstrated significant achievements in unimodal tasks such as image recognition. However, the handling of multimodal data, especially in visual question answering (VQA) tasks, presents challenges in processing the complex structural relationships among modalities. To address this issue, this article introduces a dynamic heterogeneous hypergraph neural network (HGNN) model that utilizes a Transformer-based combined attention mechanism and designs a hypergraph representation imaging network to enhance model inference without increasing parameter count. Initially, image scenes and textual questions are converted into pairs of hypergraphs with preliminary weights, which facilitate the capture of complex structural relationships through the HGNN. The hypergraph representation imaging network further aids the HGNN in learning and understanding the scene image modalities. Subsequently, a transformer-based combined attention mechanism is employed to adapt to the distinct characteristics of each modality and their intermodal interactions. This integration of multiple attention mechanisms helps identify critical structural information within the answer regions. Dynamic updates to the hyperedge weights of the hypergraph pairs, guided by the attention weights, enable the model to assimilate more relevant information progressively. Experiments on two public VQA datasets attest to the model's superior performance. Furthermore, this article envisions future advancements in model optimization and feature information extraction, extending the potential of HGNNs in multimodal fusion technology.
引用
收藏
页码:545 / 553
页数:9
相关论文
共 41 条
  • [21] Visual question answering with attention transfer and a cross-modal gating mechanism
    Li, Wei
    Sun, Jianhui
    Liu, Ge
    Zhao, Linglan
    Fang, Xiangzhong
    PATTERN RECOGNITION LETTERS, 2020, 133 (133) : 334 - 340
  • [22] Visual question answering model based on the fusion of multimodal features by a two-wav co-attention mechanism
    Sharma, Himanshu
    Srivastava, Swati
    IMAGING SCIENCE JOURNAL, 2021, 69 (1-4): : 177 - 189
  • [23] Medical visual question answering via corresponding feature fusion combined with semantic attention
    Zhu, Han
    He, Xiaohai
    Wang, Meiling
    Zhang, Mozhi
    Qing, Linbo
    MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2022, 19 (10) : 10192 - 10212
  • [24] RVT-Transformer: Residual Attention in Answerability Prediction on Visual Question Answering for Blind People
    Duy-Minh Nguyen-Tran
    Tung Le
    Khoa Pho
    Minh Le Nguyen
    Huy Tien Nguyen
    ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2022, 2022, 1653 : 423 - 435
  • [25] Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering
    Duy-Kien Nguyen
    Okatani, Takayuki
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6087 - 6096
  • [26] A lightweight Transformer-based visual question answering network with Weight-Sharing Hybrid Attention
    Zhu, Yue
    Chen, Dongyue
    Jia, Tong
    Deng, Shizhuo
    NEUROCOMPUTING, 2024, 608
  • [27] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
    Lao, Mingrui
    Guo, Yanming
    Wang, Hui
    Zhang, Xin
    IEEE ACCESS, 2018, 6 : 31516 - 31524
  • [28] Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering
    Cao, Jianjian
    Qin, Xiameng
    Zhao, Sanyuan
    Shen, Jianbing
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022,
  • [29] Research on visual question answering based on dynamic memory network model of multiple attention mechanisms
    Miao, Yalin
    He, Shuyun
    Cheng, WenFang
    Li, Guodong
    Tong, Meng
    SCIENTIFIC REPORTS, 2022, 12 (01)
  • [30] Research on visual question answering based on dynamic memory network model of multiple attention mechanisms
    Yalin Miao
    Shuyun He
    WenFang Cheng
    Guodong Li
    Meng Tong
    Scientific Reports, 12