Visual question answering with attention transfer and a cross-modal gating mechanism

被引：15

作者：

Li, Wei ^{[1
]}

Sun, Jianhui ^{[1
]}

Liu, Ge ^{[1
]}

Zhao, Linglan ^{[1
]}

Fang, Xiangzhong ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Dept Elect Engn, Shanghai 200240, Peoples R China

来源：

PATTERN RECOGNITION LETTERS | 2020年 / 133卷

关键词：

Attention; Visual question answering; Gating;

D O I：

10.1016/j.patrec.2020.02.031

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual question answering (VQA) is challenging since it requires to understand both language information and corresponding visual contents. A lot of efforts have been made to capture single-step language and visual interactions. However, answering complex questions requires multiple steps of reasoning which gradually adjusts the region of interest to the most relevant part of the given image, which has not been well investigated. To integrate question related object relations into attention mechanism, we propose a multi-step attention architecture to facilitate the modeling of multi-modal correlations. Firstly, an attention transfer mechanism is integrated to gradually adjust the region of interest considering reasoning representation of questions. Secondly, we propose a cross-modal gating strategy to filter out irrelevant information based on multi-modal correlations. Finally, we achieve the state-of-the-art performance on the VQA 1.0 dataset and favorable results on the VQA 2.0 dataset, which verifies the effectiveness of our proposed method. (C) 2020 Elsevier B.V. All rights reserved.

引用

页码：334 / 340

页数：7

共 50 条

[1] Medical visual question answering with symmetric interaction attention and cross-modal gating
Chen, Zhi
Zou, Beiji
Dai, Yulan
Zhu, Chengzhang
Kong, Guilan
Zhang, Wensheng
[J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2023, 85
[2] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
Lao, Mingrui
Guo, Yanming
Wang, Hui
Zhang, Xin
[J]. IEEE ACCESS, 2018, 6 : 31516 - 31524
[3] Cross-Modal Visual Question Answering for Remote Sensing Data
Felix, Rafael
Repasky, Boris
Hodge, Samuel
Zolfaghari, Reza
Abbasnejad, Ehsan
Sherrah, Jamie
[J]. 2021 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA 2021), 2021, : 57 - 65
[4] Cross-modal Relational Reasoning Network for Visual Question Answering
Chen, Hongyu
Liu, Ruifang
Peng, Bo
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3939 - 3948
[5] Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering
Li, Yong
Yang, Qihao
Wang, Fu Lee
Lee, Lap-Kei
Qu, Yingying
Hao, Tianyong
[J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2023, 144
[6] Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
Lerner, Paul
Ferret, Olivier
Guinaudeau, Camille
[J]. ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 421 - 438
[7] Reasoning on the Relation: Enhancing Visual Representation for Visual Question Answering and Cross-Modal Retrieval
Yu, Jing
Zhang, Weifeng
Lu, Yuhang
Qin, Zengchang
Hu, Yue
Tan, Jianlong
Wu, Qi
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (12) : 3196 - 3209
[8] Cross-modal knowledge reasoning for knowledge-based visual question answering
Yu, Jing
Zhu, Zihao
Wang, Yujing
Zhang, Weifeng
Hu, Yue
Tan, Jianlong
[J]. PATTERN RECOGNITION, 2020, 108
[9] Cross-Modal Dense Passage Retrieval for Outside Knowledge Visual Question Answering
Reichman, Benjamin
Heck, Larry
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2829 - 2834
[10] Jointly Learning Attentions with Semantic Cross-Modal Correlation for Visual Question Answering
Cao, Liangfu
Gao, Lianli
Song, Jingkuan
Xu, Xing
Shen, Heng Tao
[J]. DATABASES THEORY AND APPLICATIONS, ADC 2017, 2017, 10538 : 248 - 260

← 1 2 3 4 5 →