DHHG-TAC: Fusion of Dynamic Heterogeneous Hypergraphs and Transformer Attention Mechanism for Visual Question Answering Tasks

被引：1

作者：

Liu, Xuetao ^{[1
]}

Dong, Ruiliang ^{[1
]}

Yang, Hongyan ^{[1
]}

机构：

[1] Beijing Univ Technol, Sch Informat Sci & Technol, Beijing 100021, Peoples R China

来源：

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS | 2025年 / 21卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Transformers; Feature extraction; Visualization; Attention mechanisms; Imaging; Question answering (information retrieval); Informatics; Context modeling; Vectors; Semantics; Combined attention; hypergraph neural networks (HGNNs); visual question answering (VQA);

D O I：

10.1109/TII.2024.3453919

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Amidst the burgeoning advancements in deep learning, traditional neural networks have demonstrated significant achievements in unimodal tasks such as image recognition. However, the handling of multimodal data, especially in visual question answering (VQA) tasks, presents challenges in processing the complex structural relationships among modalities. To address this issue, this article introduces a dynamic heterogeneous hypergraph neural network (HGNN) model that utilizes a Transformer-based combined attention mechanism and designs a hypergraph representation imaging network to enhance model inference without increasing parameter count. Initially, image scenes and textual questions are converted into pairs of hypergraphs with preliminary weights, which facilitate the capture of complex structural relationships through the HGNN. The hypergraph representation imaging network further aids the HGNN in learning and understanding the scene image modalities. Subsequently, a transformer-based combined attention mechanism is employed to adapt to the distinct characteristics of each modality and their intermodal interactions. This integration of multiple attention mechanisms helps identify critical structural information within the answer regions. Dynamic updates to the hyperedge weights of the hypergraph pairs, guided by the attention weights, enable the model to assimilate more relevant information progressively. Experiments on two public VQA datasets attest to the model's superior performance. Furthermore, this article envisions future advancements in model optimization and feature information extraction, extending the potential of HGNNs in multimodal fusion technology.

引用

页码：545 / 553

页数：9

共 41 条

[21] Visual question answering with attention transfer and a cross-modal gating mechanism
Li, Wei
Sun, Jianhui
Liu, Ge
Zhao, Linglan
Fang, Xiangzhong
PATTERN RECOGNITION LETTERS, 2020, 133 (133) : 334 - 340
[22] Visual question answering model based on the fusion of multimodal features by a two-wav co-attention mechanism
Sharma, Himanshu
Srivastava, Swati
IMAGING SCIENCE JOURNAL, 2021, 69 (1-4): : 177 - 189
[23] Medical visual question answering via corresponding feature fusion combined with semantic attention
Zhu, Han
He, Xiaohai
Wang, Meiling
Zhang, Mozhi
Qing, Linbo
MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2022, 19 (10) : 10192 - 10212
[24] RVT-Transformer: Residual Attention in Answerability Prediction on Visual Question Answering for Blind People
Duy-Minh Nguyen-Tran
Tung Le
Khoa Pho
Minh Le Nguyen
Huy Tien Nguyen
ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2022, 2022, 1653 : 423 - 435
[25] Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering
Duy-Kien Nguyen
Okatani, Takayuki
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6087 - 6096
[26] A lightweight Transformer-based visual question answering network with Weight-Sharing Hybrid Attention
Zhu, Yue
Chen, Dongyue
Jia, Tong
Deng, Shizhuo
NEUROCOMPUTING, 2024, 608
[27] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
Lao, Mingrui
Guo, Yanming
Wang, Hui
Zhang, Xin
IEEE ACCESS, 2018, 6 : 31516 - 31524
[28] Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering
Cao, Jianjian
Qin, Xiameng
Zhao, Sanyuan
Shen, Jianbing
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022,
[29] Research on visual question answering based on dynamic memory network model of multiple attention mechanisms
Miao, Yalin
He, Shuyun
Cheng, WenFang
Li, Guodong
Tong, Meng
SCIENTIFIC REPORTS, 2022, 12 (01)
[30] Research on visual question answering based on dynamic memory network model of multiple attention mechanisms
Yalin Miao
Shuyun He
WenFang Cheng
Guodong Li
Meng Tong
Scientific Reports, 12

← 1 2 3 4 5 →