A Multi-level Mesh Mutual Attention Model for Visual Question Answering

被引：0

作者：

Zhi Lei

Guixian Zhang

Lijuan Wu

Kui Zhang

Rongjiao Liang

机构：

[1] Guangxi Normal University,Guangxi Key Lab of Multi

来源：

Data Science and Engineering | 2022年 / 7卷

关键词：

Visual question answering; Multi-level; Mutual attention; Multi-head;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Visual question answering is a complex multimodal task involving images and text, with broad application prospects in human–computer interaction and medical assistance. Therefore, how to deal with the feature interaction and multimodal feature fusion between the critical regions in the image and the keywords in the question is an important issue. To this end, we propose a neural network based on the encoder–decoder structure of the transformer architecture. Specifically, in the encoder, we use multi-head self-attention to mine word–word connections within question features and stack multiple layers of attention to obtain multi-level question features. We propose a mutual attention module to perform information exchange between modalities for better question features and image features representation on the decoder side. Besides, we connect the encoder and decoder in a meshed manner, perform mutual attention operations with multi-level question features, and aggregate information in an adaptive way. We propose a multi-scale fusion module in the fusion stage, which utilizes feature information at different scales to complete modal fusion. We test and validate the model effectiveness on VQA v1 and VQA v2 datasets. Our model achieves better results than state-of-the-art methods.

引用

页码：339 / 353

页数：14

共 50 条

[1] A Multi-level Mesh Mutual Attention Model for Visual Question Answering
Lei, Zhi
Zhang, Guixian
Wu, Lijuan
Zhang, Kui
Liang, Rongjiao
[J]. DATA SCIENCE AND ENGINEERING, 2022, 7 (04) : 339 - 353
[2] Multi-level Attention Networks for Visual Question Answering
Yu, Dongfei
Fu, Jianlong
Mei, Tao
Rui, Yong
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4187 - 4195
[3] Multi-source Multi-level Attention Networks for Visual Question Answering
Yu, Dongfei
Fu, Jianlong
Tian, Xinmei
Mei, Tao
[J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (02)
[4] length Context-aware Multi-level Question Embedding Fusion for visual question answering
Li, Shengdong
Gong, Chen
Zhu, Yuqing
Luo, Chuanwen
Hong, Yi
Lv, Xueqiang
[J]. INFORMATION FUSION, 2024, 102
[5] Multi-level, multi-modal interactions for visual question answering over text in images
Jincai Chen
Sheng Zhang
Jiangfeng Zeng
Fuhao Zou
Yuan-Fang Li
Tao Liu
Ping Lu
[J]. World Wide Web, 2022, 25 : 1607 - 1623
[6] Multi-level, multi-modal interactions for visual question answering over text in images
Chen, Jincai
Zhang, Sheng
Zeng, Jiangfeng
Zou, Fuhao
Li, Yuan-Fang
Liu, Tao
Lu, Ping
[J]. World Wide Web, 2022, 25 (04) : 1607 - 1623
[7] Multi-level, multi-modal interactions for visual question answering over text in images
Chen, Jincai
Zhang, Sheng
Zeng, Jiangfeng
Zou, Fuhao
Li, Yuan-Fang
Liu, Tao
Lu, Ping
[J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2022, 25 (04): : 1607 - 1623
[8] Mutual Attention Inception Network for Remote Sensing Visual Question Answering
Zheng, Xiangtao
Wang, Binqiang
Du, Xingqian
Lu, Xiaoqiang
[J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[9] Multi-level Contrastive Learning for Commonsense Question Answering
Fang, Quntian
Huang, Zhen
Zhang, Ziwen
Hu, Minghao
Hu, Biao
Wang, Ankun
Li, Dongsheng
[J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT IV, KSEM 2023, 2023, 14120 : 318 - 331
[10] Multi-grained Attention with Object-level Grounding for Visual Question Answering
Huang, Pingping
Huang, Jianhui
Guo, Yuqing
Qiao, Min
Zhu, Yong
[J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3595 - 3600

← 1 2 3 4 5 →