DHHG-TAC: Fusion of Dynamic Heterogeneous Hypergraphs and Transformer Attention Mechanism for Visual Question Answering Tasks

被引:1
|
作者
Liu, Xuetao [1 ]
Dong, Ruiliang [1 ]
Yang, Hongyan [1 ]
机构
[1] Beijing Univ Technol, Sch Informat Sci & Technol, Beijing 100021, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Feature extraction; Visualization; Attention mechanisms; Imaging; Question answering (information retrieval); Informatics; Context modeling; Vectors; Semantics; Combined attention; hypergraph neural networks (HGNNs); visual question answering (VQA);
D O I
10.1109/TII.2024.3453919
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Amidst the burgeoning advancements in deep learning, traditional neural networks have demonstrated significant achievements in unimodal tasks such as image recognition. However, the handling of multimodal data, especially in visual question answering (VQA) tasks, presents challenges in processing the complex structural relationships among modalities. To address this issue, this article introduces a dynamic heterogeneous hypergraph neural network (HGNN) model that utilizes a Transformer-based combined attention mechanism and designs a hypergraph representation imaging network to enhance model inference without increasing parameter count. Initially, image scenes and textual questions are converted into pairs of hypergraphs with preliminary weights, which facilitate the capture of complex structural relationships through the HGNN. The hypergraph representation imaging network further aids the HGNN in learning and understanding the scene image modalities. Subsequently, a transformer-based combined attention mechanism is employed to adapt to the distinct characteristics of each modality and their intermodal interactions. This integration of multiple attention mechanisms helps identify critical structural information within the answer regions. Dynamic updates to the hyperedge weights of the hypergraph pairs, guided by the attention weights, enable the model to assimilate more relevant information progressively. Experiments on two public VQA datasets attest to the model's superior performance. Furthermore, this article envisions future advancements in model optimization and feature information extraction, extending the potential of HGNNs in multimodal fusion technology.
引用
收藏
页码:545 / 553
页数:9
相关论文
共 41 条
  • [31] DSAF: A Dual-Stage Attention Based Multimodal Fusion Framework for Medical Visual Question Answering
    K. Mukesh
    S. L. Jayaprakash
    R. Prasanna Kumar
    SN Computer Science, 6 (4)
  • [32] Advanced Visual and Textual Co-context Aware Attention Network with Dependent Multimodal Fusion Block for Visual Question Answering
    Asri H.S.
    Safabakhsh R.
    Multimedia Tools and Applications, 2024, 83 (40) : 87959 - 87986
  • [33] Multi-page Document Visual Question Answering Using Self-attention Scoring Mechanism
    Kang, Lei
    Tito, Ruben
    Valveny, Ernest
    Karatzas, Dimosthenis
    DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT VI, 2024, 14809 : 219 - 232
  • [34] Integrating multimodal features by a two-way co-attention mechanism for visual question answering
    Sharma, Himanshu
    Srivastava, Swati
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (21) : 59577 - 59595
  • [35] Enhancing scene-text visual question answering with relational reasoning, attention and dynamic vocabulary integration
    Agrawal, Mayank
    Jalal, Anand Singh
    Sharma, Himanshu
    COMPUTATIONAL INTELLIGENCE, 2024, 40 (01)
  • [36] Visual Question Answering Research on Multi-layer Attention Mechanism Based on Image Target Features
    Cao, Danyang
    Ren, Xu
    Zhu, Menggui
    Song, Wei
    HUMAN-CENTRIC COMPUTING AND INFORMATION SCIENCES, 2021, 11
  • [37] Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism
    Xia, Qihao
    Yu, Chao
    Hou, Yinong
    Peng, Pingping
    Zheng, Zhengqi
    Chen, Wen
    ELECTRONICS, 2022, 11 (11)
  • [38] Mix-tower: Light visual question answering framework based on exclusive self-attention mechanism
    Chen, Deguang
    Chen, Jianrui
    Yang, Luheng
    Shang, Fanhua
    NEUROCOMPUTING, 2024, 587
  • [39] Enhancing visual question answering with a two-way co-attention mechanism and integrated multimodal features
    Agrawal, Mayank
    Jalal, Anand Singh
    Sharma, Himanshu
    COMPUTATIONAL INTELLIGENCE, 2024, 40 (01)
  • [40] AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio-Visual Speech Recognition
    Che, Na
    Zhu, Yiming
    Wang, Haiyan
    Zeng, Xianwei
    Du, Qinsheng
    APPLIED SCIENCES-BASEL, 2025, 15 (01):