DHHG-TAC: Fusion of Dynamic Heterogeneous Hypergraphs and Transformer Attention Mechanism for Visual Question Answering Tasks

被引：1

作者：

Liu, Xuetao ^{[1
]}

Dong, Ruiliang ^{[1
]}

Yang, Hongyan ^{[1
]}

机构：

[1] Beijing Univ Technol, Sch Informat Sci & Technol, Beijing 100021, Peoples R China

来源：

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS | 2025年 / 21卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Transformers; Feature extraction; Visualization; Attention mechanisms; Imaging; Question answering (information retrieval); Informatics; Context modeling; Vectors; Semantics; Combined attention; hypergraph neural networks (HGNNs); visual question answering (VQA);

D O I：

10.1109/TII.2024.3453919

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Amidst the burgeoning advancements in deep learning, traditional neural networks have demonstrated significant achievements in unimodal tasks such as image recognition. However, the handling of multimodal data, especially in visual question answering (VQA) tasks, presents challenges in processing the complex structural relationships among modalities. To address this issue, this article introduces a dynamic heterogeneous hypergraph neural network (HGNN) model that utilizes a Transformer-based combined attention mechanism and designs a hypergraph representation imaging network to enhance model inference without increasing parameter count. Initially, image scenes and textual questions are converted into pairs of hypergraphs with preliminary weights, which facilitate the capture of complex structural relationships through the HGNN. The hypergraph representation imaging network further aids the HGNN in learning and understanding the scene image modalities. Subsequently, a transformer-based combined attention mechanism is employed to adapt to the distinct characteristics of each modality and their intermodal interactions. This integration of multiple attention mechanisms helps identify critical structural information within the answer regions. Dynamic updates to the hyperedge weights of the hypergraph pairs, guided by the attention weights, enable the model to assimilate more relevant information progressively. Experiments on two public VQA datasets attest to the model's superior performance. Furthermore, this article envisions future advancements in model optimization and feature information extraction, extending the potential of HGNNs in multimodal fusion technology.

引用

页码：545 / 553

页数：9

共 41 条

[31] DSAF: A Dual-Stage Attention Based Multimodal Fusion Framework for Medical Visual Question Answering
K. Mukesh
S. L. Jayaprakash
R. Prasanna Kumar
SN Computer Science, 6 (4)
[32] Advanced Visual and Textual Co-context Aware Attention Network with Dependent Multimodal Fusion Block for Visual Question Answering
Asri H.S.
Safabakhsh R.
Multimedia Tools and Applications, 2024, 83 (40) : 87959 - 87986
[33] Multi-page Document Visual Question Answering Using Self-attention Scoring Mechanism
Kang, Lei
Tito, Ruben
Valveny, Ernest
Karatzas, Dimosthenis
DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT VI, 2024, 14809 : 219 - 232
[34] Integrating multimodal features by a two-way co-attention mechanism for visual question answering
Sharma, Himanshu
Srivastava, Swati
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (21) : 59577 - 59595
[35] Enhancing scene-text visual question answering with relational reasoning, attention and dynamic vocabulary integration
Agrawal, Mayank
Jalal, Anand Singh
Sharma, Himanshu
COMPUTATIONAL INTELLIGENCE, 2024, 40 (01)
[36] Visual Question Answering Research on Multi-layer Attention Mechanism Based on Image Target Features
Cao, Danyang
Ren, Xu
Zhu, Menggui
Song, Wei
HUMAN-CENTRIC COMPUTING AND INFORMATION SCIENCES, 2021, 11
[37] Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism
Xia, Qihao
Yu, Chao
Hou, Yinong
Peng, Pingping
Zheng, Zhengqi
Chen, Wen
ELECTRONICS, 2022, 11 (11)
[38] Mix-tower: Light visual question answering framework based on exclusive self-attention mechanism
Chen, Deguang
Chen, Jianrui
Yang, Luheng
Shang, Fanhua
NEUROCOMPUTING, 2024, 587
[39] Enhancing visual question answering with a two-way co-attention mechanism and integrated multimodal features
Agrawal, Mayank
Jalal, Anand Singh
Sharma, Himanshu
COMPUTATIONAL INTELLIGENCE, 2024, 40 (01)
[40] AFT-SAM: Adaptive Fusion Transformer with a Sparse Attention Mechanism for Audio-Visual Speech Recognition
Che, Na
Zhu, Yiming
Wang, Haiyan
Zeng, Xianwei
Du, Qinsheng
APPLIED SCIENCES-BASEL, 2025, 15 (01):

← 1 2 3 4 5 →