Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

被引：0

作者：

Cai, Linqin ^{[1
]}

Xu, Nuoying ^{[1
]}

Tian, Hang ^{[1
]}

Chen, Kejia ^{[2
]}

Fan, Haodu ^{[1
]}

机构：

[1] Chongqing Univ Posts & Telecommun, Res Ctr Artificial Intelligence & Smart Educ, Chongqing 400065, Peoples R China

[2] Chengdu Huawei Technol Co Ltd, Chengdu 500643, Peoples R China

来源：

NEURAL PROCESSING LETTERS | 2023年 / 55卷 / 09期

基金：

中国国家自然科学基金;

关键词：

Visual question answering; Attention mechanism; Position attention; Deep learning; FUSION; KNOWLEDGE;

D O I：

10.1007/s11063-023-11403-0

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current visual question answering (VQA) has become a research hotspot in the computer vision and natural language processing field. A core solution of VQA is how to fuse multi-modal features from images and questions. This paper proposes a Multimodal Bi-direction Guided Attention Network (MBGAN) for VQA by combining visual relationships and attention to achieve more refined feature fusion. Specifically, the self-attention is used to extract image features and text features, the guided-attention is applied to obtain the correlation between each image area and the related question. To obtain the relative position relationship of different objects, position attention is further introduced to realize relationship correlation modeling and enhance the matching ability of multi-modal features. Given an image and a natural language question, the proposed MBGAN learns visual relation inference and question attention networks in parallel to achieve the fine-grained fusion of the visual features and the textual features, then the final answers can be obtained accurately through model stacking. MBGAN achieves 69.41% overall accuracy on the VQA-v1 dataset, 70.79% overall accuracy on the VQA-v2 dataset, and 68.79% overall accuracy on the COCO-QA dataset, which shows that the proposed MBGAN outperforms most of the state-of-the-art models.

引用

页码：11921 / 11943

页数：23

共 50 条

[1] Multimodal Bi-direction Guided Attention Networks for Visual Question Answering
Linqin Cai
Nuoying Xu
Hang Tian
Kejia Chen
Haodu Fan
Neural Processing Letters, 2023, 55 : 11921 - 11943
[2] Bi-direction Co-Attention Network on Visual Question Answering for Blind People
Tung Le
Thong Bui
Huy Tien Nguyen
Minh Le Nguyen
FOURTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2021), 2022, 12084
[3] Multimodal Cross-guided Attention Networks for Visual Question Answering
Liu, Haibin
Gong, Shengrong
Ji, Yi
Yang, Jianyu
Xing, Tengfei
Liu, Chunping
PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON COMPUTER MODELING, SIMULATION AND ALGORITHM (CMSA 2018), 2018, 151 : 347 - 353
[4] Multimodal Attention for Visual Question Answering
Kodra, Lorena
Mece, Elinda Kajo
INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
[5] Multimodal Encoder-Decoder Attention Networks for Visual Question Answering
Chen, Chongqing
Han, Dezhi
Wang, Jun
IEEE ACCESS, 2020, 8 : 35662 - 35671
[6] Question Type Guided Attention in Visual Question Answering
Shi, Yang
Furlanello, Tommaso
Zha, Sheng
Anandkumar, Animashree
COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 : 158 - 175
[7] Dual Self-Guided Attention with Sparse Question Networks for Visual Question Answering
Shen, Xiang
Han, Dezhi
Chang, Chin-Chen
Zong, Liang
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (04) : 785 - 796
[8] QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document
Mahamoud, Ibrahim Souleiman
Coustaty, Mickael
Joseph, Aurelie
d'Andecy, Vincent Poulain
Ogier, Jean-Marc
DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 659 - 673
[9] Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering
Li, Haiyan
Han, Dezhi
COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2021, 18 (03) : 1023 - 1040
[10] BAFN: Bi-Direction Attention Based Fusion Network for Multimodal Sentiment Analysis
Tang, Jiajia
Liu, Dongjun
Jin, Xuanyu
Peng, Yong
Zhao, Qibin
Ding, Yu
Kong, Wanzeng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (04) : 1966 - 1978

← 1 2 3 4 5 →