Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering

被引：6

作者：

Guo, Zihan ^{[1
]}

Han, Dezhi ^{[1
]}

机构：

[1] Shanghai Maritime Univ, Coll Informat Engn, Shanghai 201306, Peoples R China

来源：

SENSORS | 2020年 / 20卷 / 23期

基金：

中国国家自然科学基金;

关键词：

attention mechanism; computer vision; natural language processing; sparse attention; visual question answering;

D O I：

10.3390/s20236758

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

Visual question answering (VQA) is a multi-modal task involving natural language processing (NLP) and computer vision (CV), which requires models to understand of both visual information and textual information simultaneously to predict the correct answer for the input visual image and textual question, and has been widely used in smart and intelligent transport systems, smart city, and other fields. Today, advanced VQA approaches model dense interactions between image regions and question words by designing co-attention mechanisms to achieve better accuracy. However, modeling interactions between each image region and each question word will force the model to calculate irrelevant information, thus causing the model's attention to be distracted. In this paper, to solve this problem, we propose a novel model called Multi-modal Explicit Sparse Attention Networks (MESAN), which concentrates the model's attention by explicitly selecting the parts of the input features that are the most relevant to answering the input question. We consider that this method based on top-k selection can reduce the interference caused by irrelevant information and ultimately help the model to achieve better performance. The experimental results on the benchmark dataset VQA v2 demonstrate the effectiveness of our model. Our best single model delivers 70.71% and 71.08% overall accuracy on the test-dev and test-std sets, respectively. In addition, we also demonstrate that our model can obtain better attended features than other advanced models through attention visualization. Our work proves that the models with sparse attention mechanisms can also achieve competitive results on VQA datasets. We hope that it can promote the development of VQA models and the application of artificial intelligence (AI) technology related to VQA in various aspects.

引用

页码：1 / 15

页数：15

共 50 条

[1] Multi-modal spatial relational attention networks for visual question answering
Yao, Haibo
Wang, Lipeng
Cai, Chengtao
Sun, Yuxin
Zhang, Zhi
Luo, Yongkang
[J]. IMAGE AND VISION COMPUTING, 2023, 140
[2] Multi-modal co-attention relation networks for visual question answering
Zihan Guo
Dezhi Han
[J]. The Visual Computer, 2023, 39 : 5783 - 5795
[3] Multi-modal co-attention relation networks for visual question answering
Guo, Zihan
Han, Dezhi
[J]. VISUAL COMPUTER, 2023, 39 (11): : 5783 - 5795
[4] Adversarial Learning With Multi-Modal Attention for Visual Question Answering
Liu, Yun
Zhang, Xiaoming
Huang, Feiran
Cheng, Lei
Li, Zhoujun
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (09) : 3894 - 3908
[5] The multi-modal fusion in visual question answering: a review of attention mechanisms
Lu, Siyu
Liu, Mingzhe
Yin, Lirong
Yin, Zhengtong
Liu, Xuan
Zheng, Wenfeng
[J]. PEERJ COMPUTER SCIENCE, 2023, 9
[6] Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism
Xia, Qihao
Yu, Chao
Hou, Yinong
Peng, Pingping
Zheng, Zhengqi
Chen, Wen
[J]. ELECTRONICS, 2022, 11 (11)
[7] Differentiated Attention with Multi-modal Reasoning for Video Question Answering
Yao, Shentao
Li, Kun
Xing, Kun
Wu, Kewei
Xie, Zhao
Guo, Dan
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING, BIG DATA AND ALGORITHMS (EEBDA), 2022, : 525 - 530
[8] Multi-modal adaptive gated mechanism for visual question answering
Xu, Yangshuyi
Zhang, Lin
Shen, Xiang
[J]. PLOS ONE, 2023, 18 (06):
[9] Answer-checking in Context: A Multi-modal Fully Attention Network for Visual Question Answering
Huang, Hantao
Han, Tao
Han, Wei
Yap, Deep
Chiang, Cheng-Ming
[J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 1173 - 1180
[10] Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering
Yu, Zhou
Yu, Jun
Fan, Jianping
Tao, Dacheng
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1839 - 1848

← 1 2 3 4 5 →