Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

被引：215

作者：

Duy-Kien Nguyen ^{[1
]}

Okatani, Takayuki ^{[1
,2
]}

机构：

[1] Tohoku Univ, Sendai, Miyagi, Japan

[2] RIKEN, Ctr AIP, Wako, Saitama, Japan

来源：

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2018年

关键词：

D O I：

10.1109/CVPR.2018.00637

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions, which leads to the correct answer prediction.

引用

页码：6087 / 6096

页数：10

共 50 条

[1] An Effective Dense Co-Attention Networks for Visual Question Answering
He, Shirong
Han, Dezhi
[J]. SENSORS, 2020, 20 (17) : 1 - 15
[2] IMCN: Improved modular co-attention networks for visual question answering
Liu, Cheng
Wang, Chao
Peng, Yan
[J]. APPLIED INTELLIGENCE, 2024, 54 (06) : 5167 - 5182
[3] Co-Attention Network With Question Type for Visual Question Answering
Yang, Chao
Jiang, Mengqi
Jiang, Bin
Zhou, Weixin
Li, Keqin
[J]. IEEE ACCESS, 2019, 7 : 40771 - 40781
[4] Dynamic Co-attention Network for Visual Question Answering
Ebaid, Doaa B.
Madbouly, Magda M.
El-Zoghabi, Adel A.
[J]. 2021 8TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE (ISCMI 2021), 2021, : 125 - 129
[5] Hierarchical Question-Image Co-Attention for Visual Question Answering
Lu, Jiasen
Yang, Jianwei
Batra, Dhruv
Parikh, Devi
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
[6] Deep Modular Co-Attention Networks for Visual Question Answering
Yu, Zhou
Yu, Jun
Cui, Yuhao
Tao, Dacheng
Tian, Qi
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6274 - 6283
[7] Co-attention Network for Visual Question Answering Based on Dual Attention
Dong, Feng
Wang, Xiaofeng
Oad, Ammar
Talpur, Mir Sajjad Hussain
[J]. Journal of Engineering Science and Technology Review, 2021, 14 (06) : 116 - 123
[8] Co-attention graph convolutional network for visual question answering
Liu, Chuan
Tan, Ying-Ying
Xia, Tian-Tian
Zhang, Jiajing
Zhu, Ming
[J]. MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2527 - 2543
[9] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
Lao, Mingrui
Guo, Yanming
Wang, Hui
Zhang, Xin
[J]. IEEE ACCESS, 2018, 6 : 31516 - 31524
[10] Co-attention graph convolutional network for visual question answering
Chuan Liu
Ying-Ying Tan
Tian-Tian Xia
Jiajing Zhang
Ming Zhu
[J]. Multimedia Systems, 2023, 29 : 2527 - 2543

← 1 2 3 4 5 →