Dual self-attention with co-attention networks for visual question answering

被引：40

作者：

Liu, Yun ^{[1
,2
]}

Zhang, Xiaoming ^{[3
]}

Zhang, Qianyun ^{[3
]}

Li, Chaozhuo ^{[4
]}

Huang, Feiran ^{[5
]}

Tang, Xianghong ^{[6
]}

Li, Zhoujun ^{[2
]}

机构：

[1] Beijing Informat Sci & Technol Univ, Beijing Key Lab Internet Culture & Digital Dissem, Beijing, Peoples R China

[2] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing, Peoples R China

[3] Beihang Univ, Sch Cyber Sci & Technol, Beijing, Peoples R China

[4] Microsoft Res Asia, Beijing, Peoples R China

[5] Jinan Univ, Coll Informat Sci & Technol, Coll Cyber Secur, Guangzhou, Peoples R China

[6] Guizhou Univ, Key Lab Adv Mfg Technol, Minist Educ, Guiyang, Peoples R China

来源：

PATTERN RECOGNITION | 2021年 / 117卷

基金：

中国国家自然科学基金; 北京市自然科学基金;

关键词：

Self-attention; Visual-textual co-attention; Visual question answering;

D O I：

10.1016/j.patcog.2021.107956

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual Question Answering (VQA) as an important task in understanding vision and language has been proposed and aroused wide interests. In previous VQA methods, Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are generally used to extract visual and textual features respectively, and then the correlation between these two features is explored to infer the answer. However, CNN mainly focuses on extracting local spatial information and RNN pays more attention on exploiting sequential architecture and long-range dependencies. It is difficult for them to integrate the local features with their global dependencies to learn more effective representations of the image and question. To address this problem, we propose a novel model, i.e., Dual Self-Attention with Co-Attention networks (DSACA), for VQA. It aims to model the internal dependencies of both the spatial and sequential structure respectively by using the newly proposed self-attention mechanism. Specifically, DSACA mainly contains three sub modules. The visual self-attention module selectively aggregates the visual features at each region by a weighted sum of the features at all positions. The textual self-attention module automatically emphasizes the interdependent word features by integrating associated features among the sentence words. Besides, the visual-textual co-attention module explores the close correlation between visual and textual features learned from self-attention modules. The three modules are integrated into an end-to-end framework to infer the answer. Extensive experiments performed on three generally used VQA datasets confirm the favorable performance of DSACA compared with state-of-the-art methods. 0 2021 Elsevier Ltd. All rights reserved.

引用

页数：13

共 50 条

[21] Dual Self-Guided Attention with Sparse Question Networks for Visual Question Answering
Shen, Xiang
Han, Dezhi
Chang, Chin-Chen
Zong, Liang
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (04) : 785 - 796
[22] SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering
Cao, Feiqi
Luo, Siwen
Nunez, Felipe
Wen, Zean
Poon, Josiah
Han, Soyeon Caren
[J]. ROBOTICS, 2023, 12 (04)
[23] Multimodal feature-wise co-attention method for visual question answering
Zhang, Sheng
Chen, Min
Chen, Jincai
Zou, Fuhao
Li, Yuan-Fang
Lu, Ping
[J]. INFORMATION FUSION, 2021, 73 : 1 - 10
[24] Sentence Matching with Deep Self-attention and Co-attention Features
Wang, Zhipeng
Yan, Danfeng
[J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2021, PT II, 2021, 12816 : 550 - 561
[25] Dynamic self-attention with vision synchronization networks for video question answering
Liu, Yun
Zhang, Xiaoming
Huang, Feiran
Shen, Shixun
Tian, Peng
Li, Lang
Li, Zhoujun
[J]. PATTERN RECOGNITION, 2022, 132
[26] Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering
Duy-Kien Nguyen
Okatani, Takayuki
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6087 - 6096
[27] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
Lao, Mingrui
Guo, Yanming
Wang, Hui
Zhang, Xin
[J]. IEEE ACCESS, 2018, 6 : 31516 - 31524
[28] Bi-direction Co-Attention Network on Visual Question Answering for Blind People
Tung Le
Thong Bui
Huy Tien Nguyen
Minh Le Nguyen
[J]. FOURTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2021), 2022, 12084
[29] Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering
Yu, Zhou
Yu, Jun
Fan, Jianping
Tao, Dacheng
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1839 - 1848
[30] Integrating multimodal features by a two-way co-attention mechanism for visual question answering
Sharma, Himanshu
Srivastava, Swati
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (21) : 59577 - 59595

← 1 2 3 4 5 →