Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

被引：215

作者：

Duy-Kien Nguyen ^{[1
]}

Okatani, Takayuki ^{[1
,2
]}

机构：

[1] Tohoku Univ, Sendai, Miyagi, Japan

[2] RIKEN, Ctr AIP, Wako, Saitama, Japan

来源：

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2018年

关键词：

D O I：

10.1109/CVPR.2018.00637

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions, which leads to the correct answer prediction.

引用

页码：6087 / 6096

页数：10

共 50 条

[21] LRCN: Layer-residual Co-Attention Networks for visual question answering
Han, Dezhi
Shi, Jingya
Zhao, Jiahao
Wu, Huafeng
Zhou, Yachao
Li, Ling-Huey
Khan, Muhammad Khurram
Li, Kuan-Ching
[J]. Expert Systems with Applications, 2025, 263
[22] Multimodal feature-wise co-attention method for visual question answering
Zhang, Sheng
Chen, Min
Chen, Jincai
Zou, Fuhao
Li, Yuan-Fang
Lu, Ping
[J]. INFORMATION FUSION, 2021, 73 : 1 - 10
[23] Visual question answering model based on the fusion of multimodal features by a two-wav co-attention mechanism
Sharma, Himanshu
Srivastava, Swati
[J]. IMAGING SCIENCE JOURNAL, 2021, 69 (1-4): : 177 - 189
[24] Feature Fusion Attention Visual Question Answering
Wang, Chunlin
Sun, Jianyong
Chen, Xiaolin
[J]. ICMLC 2019: 2019 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2019, : 412 - 416
[25] Bi-direction Co-Attention Network on Visual Question Answering for Blind People
Tung Le
Thong Bui
Huy Tien Nguyen
Minh Le Nguyen
[J]. FOURTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2021), 2022, 12084
[26] SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering
Cao, Feiqi
Luo, Siwen
Nunez, Felipe
Wen, Zean
Poon, Josiah
Han, Soyeon Caren
[J]. ROBOTICS, 2023, 12 (04)
[27] Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering
Yu, Zhou
Yu, Jun
Fan, Jianping
Tao, Dacheng
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1839 - 1848
[28] ADAPTIVE ATTENTION FUSION NETWORK FOR VISUAL QUESTION ANSWERING
Gu, Geonmo
Kim, Seong Tae
Ro, Yong Man
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 997 - 1002
[29] Integrating multimodal features by a two-way co-attention mechanism for visual question answering
Sharma, Himanshu
Srivastava, Swati
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (21) : 59577 - 59595
[30] SPCA-Net: a based on spatial position relationship co-attention network for visual question answering
Yan, Feng
Silamu, Wushouer
Li, Yanbin
Chai, Yachuang
[J]. VISUAL COMPUTER, 2022, 38 (9-10): : 3097 - 3108

← 1 2 3 4 5 →