DHHG-TAC: Fusion of Dynamic Heterogeneous Hypergraphs and Transformer Attention Mechanism for Visual Question Answering Tasks

被引:1
|
作者
Liu, Xuetao [1 ]
Dong, Ruiliang [1 ]
Yang, Hongyan [1 ]
机构
[1] Beijing Univ Technol, Sch Informat Sci & Technol, Beijing 100021, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Feature extraction; Visualization; Attention mechanisms; Imaging; Question answering (information retrieval); Informatics; Context modeling; Vectors; Semantics; Combined attention; hypergraph neural networks (HGNNs); visual question answering (VQA);
D O I
10.1109/TII.2024.3453919
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Amidst the burgeoning advancements in deep learning, traditional neural networks have demonstrated significant achievements in unimodal tasks such as image recognition. However, the handling of multimodal data, especially in visual question answering (VQA) tasks, presents challenges in processing the complex structural relationships among modalities. To address this issue, this article introduces a dynamic heterogeneous hypergraph neural network (HGNN) model that utilizes a Transformer-based combined attention mechanism and designs a hypergraph representation imaging network to enhance model inference without increasing parameter count. Initially, image scenes and textual questions are converted into pairs of hypergraphs with preliminary weights, which facilitate the capture of complex structural relationships through the HGNN. The hypergraph representation imaging network further aids the HGNN in learning and understanding the scene image modalities. Subsequently, a transformer-based combined attention mechanism is employed to adapt to the distinct characteristics of each modality and their intermodal interactions. This integration of multiple attention mechanisms helps identify critical structural information within the answer regions. Dynamic updates to the hyperedge weights of the hypergraph pairs, guided by the attention weights, enable the model to assimilate more relevant information progressively. Experiments on two public VQA datasets attest to the model's superior performance. Furthermore, this article envisions future advancements in model optimization and feature information extraction, extending the potential of HGNNs in multimodal fusion technology.
引用
收藏
页码:545 / 553
页数:9
相关论文
共 41 条
  • [41] VG-CALF: A vision-guided cross-attention and late-fusion network for radiology images in Medical Visual Question Answering
    Lameesa, Aiman
    Silpasuwanchai, Chaklam
    Bin Alam, Sakib
    NEUROCOMPUTING, 2025, 613