Deep Modular Bilinear Attention Network for Visual Question Answering

被引：0

作者：

Yan, Feng ^{[1
]}

Silamu, Wushouer ^{[1
,2
]}

Li, Yanbing ^{[1
]}

机构：

[1] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi 830046, Peoples R China

[2] Xinjiang Univ, Lab Multilingual Informat Technol, Urumqi 830046, Peoples R China

来源：

SENSORS | 2022年 / 22卷 / 03期

基金：

中国国家自然科学基金;

关键词：

attention mechanism; visual question answering; multi-model; bilinear attention network;

D O I：

10.3390/s22031045

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modality and inter-modality attention between visual and language features. In this paper, the BAN (Bilinear Attention Network) method was used to calculate attention. We propose a deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations. The two basic attention units are the core of the whole network framework and can be cascaded in depth. In addition, we encode the question based on the dynamic word vector of BERT(Bidirectional Encoder Representations from Transformers), then use self-attention to process the question features further. Then we sum them with the features obtained by BAN-GA and BAN-SA before the final classification. Without using the Visual Genome datasets for augmentation, the accuracy of our model reaches 70.85% on the test-std dataset of VQA 2.0.

引用

页数：15

共 50 条

[41] Multimodal Local Perception Bilinear Pooling for Visual Question Answering
Lao, Mingrui
Guo, Yanming
Wang, Hui
Zhang, Xin
[J]. IEEE ACCESS, 2018, 6 : 57923 - 57932
[42] Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering
Yu, Zhou
Yu, Jun
Fan, Jianping
Tao, Dacheng
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1839 - 1848
[43] Visual question answering model based on graph neural network and contextual attention
Sharma, Himanshu
Jalal, Anand Singh
[J]. IMAGE AND VISION COMPUTING, 2021, 110
[44] Multi-Channel Co-Attention Network for Visual Question Answering
Tian, Weidong
He, Bin
Wang, Nanxun
Zhao, Zhongqiu
[J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
[45] CRA-Net: Composed Relation Attention Network for Visual Question Answering
Peng, Liang
Yang, Yang
Wang, Zheng
Wu, Xiao
Huang, Zi
[J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1202 - 1210
[46] Compound-Attention Network with Original Feature injection for visual question and answering
Wu, Chunlei
Lu, Jing
Li, Haisheng
Wu, Jie
Duan, Hailong
Yuan, Shaozu
[J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2021, 15 (08) : 1853 - 1861
[47] Efficient Multi-step Reasoning Attention Network for Visual Question Answering
Zhang, Haotian
Wu, Wei
Zhang, Meng
[J]. THIRTEENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING (ICGIP 2021), 2022, 12083
[48] Multi-Modality Global Fusion Attention Network for Visual Question Answering
Yang, Cheng
Wu, Weijia
Wang, Yuxing
Zhou, Hong
[J]. ELECTRONICS, 2020, 9 (11) : 1 - 12
[49] Compound-Attention Network with Original Feature injection for visual question and answering
Chunlei Wu
Jing Lu
Haisheng Li
Jie Wu
Hailong Duan
Shaozu Yuan
[J]. Signal, Image and Video Processing, 2021, 15 : 1853 - 1861
[50] Affective Visual Question Answering Network
Ruwa, Nelson
Mao, Qirong
Wang, Liangjun
Dong, Ming
[J]. IEEE 1ST CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2018), 2018, : 170 - 173

← 1 2 3 4 5 →