Deep Modular Bilinear Attention Network for Visual Question Answering

被引：0

作者：

Yan, Feng ^{[1
]}

Silamu, Wushouer ^{[1
,2
]}

Li, Yanbing ^{[1
]}

机构：

[1] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi 830046, Peoples R China

[2] Xinjiang Univ, Lab Multilingual Informat Technol, Urumqi 830046, Peoples R China

来源：

SENSORS | 2022年 / 22卷 / 03期

基金：

中国国家自然科学基金;

关键词：

attention mechanism; visual question answering; multi-model; bilinear attention network;

D O I：

10.3390/s22031045

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modality and inter-modality attention between visual and language features. In this paper, the BAN (Bilinear Attention Network) method was used to calculate attention. We propose a deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations. The two basic attention units are the core of the whole network framework and can be cascaded in depth. In addition, we encode the question based on the dynamic word vector of BERT(Bidirectional Encoder Representations from Transformers), then use self-attention to process the question features further. Then we sum them with the features obtained by BAN-GA and BAN-SA before the final classification. Without using the Visual Genome datasets for augmentation, the accuracy of our model reaches 70.85% on the test-std dataset of VQA 2.0.

引用

页数：15

共 50 条

[1] Deep Modular Co-Attention Networks for Visual Question Answering
Yu, Zhou
Yu, Jun
Cui, Yuhao
Tao, Dacheng
Tian, Qi
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6274 - 6283
[2] Deep Attention Neural Tensor Network for Visual Question Answering
Bai, Yalong
Fu, Jianlong
Zhao, Tiejun
Mei, Tao
[J]. COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 : 21 - 37
[3] Collaborative Attention Network to Enhance Visual Question Answering
Gu, Rui
[J]. BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2019, 124 : 304 - 305
[4] Triple attention network for sentimental visual question answering
Ruwa, Nelson
Mao, Qirong
Song, Heping
Jia, Hongjie
Dong, Ming
[J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2019, 189
[5] ADAPTIVE ATTENTION FUSION NETWORK FOR VISUAL QUESTION ANSWERING
Gu, Geonmo
Kim, Seong Tae
Ro, Yong Man
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 997 - 1002
[6] Co-Attention Network With Question Type for Visual Question Answering
Yang, Chao
Jiang, Mengqi
Jiang, Bin
Zhou, Weixin
Li, Keqin
[J]. IEEE ACCESS, 2019, 7 : 40771 - 40781
[7] Bilinear Graph Networks for Visual Question Answering
Guo, Dalu
Xu, Chang
Tao, Dacheng
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (02) : 1023 - 1034
[8] Modular dual-stream visual fusion network for visual question answering
Xue, Lixia
Wang, Wenhao
Wang, Ronggui
Yang, Juan
[J]. VISUAL COMPUTER, 2024,
[9] Local relation network with multilevel attention for visual question answering
Sun, Bo
Yao, Zeng
Zhang, Yinghui
Yu, Lejun
[J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 73
[10] Word-to-region attention network for visual question answering
Liang Peng
Yang Yang
Yi Bin
Ning Xie
Fumin Shen
Yanli Ji
Xing Xu
[J]. Multimedia Tools and Applications, 2019, 78 : 3843 - 3858

← 1 2 3 4 5 →