Cross-modal Relational Reasoning Network for Visual Question Answering

被引：5

作者：

Chen, Hongyu ^{[1
]}

Liu, Ruifang ^{[1
]}

Peng, Bo ^{[2
]}

机构：

[1] Beijing Univ Posts & Telecommun, Beijing, Peoples R China

[2] Tencent, Shenyang, Peoples R China

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021) | 2021年

关键词：

D O I：

10.1109/ICCVW54120.2021.00441

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual Question Answering (VQA) is a challenging task that requires a cross-modal understanding of images and questions with relational reasoning leading to the correct answer. To bridge the semantic gap between these two modalities, previous works focus on the word-region alignments of all possible pairs without attending more attention to the corresponding word and object. Treating all pairs equally without consideration of relation consistency hinders the model's performance. In this paper, to align the relation-consistent pairs and integrate the interpretability of VQA systems, we propose a Cross-modal Relational Reasoning Network (CRRN), to mask the inconsistent attention map and highlight the full latent alignments of corresponding word-region pairs. Specifically, we present two relational masks for inter-modal and intra-modal highlighting, inferring the more and less important words in sentences or regions in images. The attention interrelationship of consistent pairs can be enhanced with the shift of learning focus by masking the unaligned relations. Then, we propose two novel losses L-CMAM and L-SMAM with explicit supervision to capture the fine-grained interplay between vision and language. We have conduct thorough experiments to prove the effectiveness and achieve the competitive performance for reaching 61.74% on GQA benchmark.

引用

页码：3939 / 3948

页数：10

共 50 条

[31] Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning
Zhang, Xi
Zhang, Feifei
Xu, Changsheng
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 2986 - 2997
[32] Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering
Lyu, Chenyang
Li, Wenxi
Ji, Tianbo
Zhou, Liting
Gurrin, Cathal
[J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 427 - 438
[33] Sequential Visual Reasoning for Visual Question Answering
Liu, Jinlai
Wu, Chenfei
Wang, Xiaojie
Dong, Xuan
[J]. PROCEEDINGS OF 2018 5TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (CCIS), 2018, : 410 - 415
[34] CroMIC-QA: The Cross-Modal Information Complementation Based Question Answering
Qian, Shun
Liu, Bingquan
Sun, Chengjie
Xu, Zhen
Ma, Lin
Wang, Baoxun
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8348 - 8359
[35] Chain of Reasoning for Visual Question Answering
Wu, Chenfei
Liu, Jinlai
Wang, Xiaojie
Dong, Xuan
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[36] Robust visual question answering via semantic cross modal augmentation
Mashrur, Akib
Luo, Wei
Zaidi, Nayyar A.
Robles-Kelly, Antonio
[J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 238
[37] A Spatial Hierarchical Reasoning Network for Remote Sensing Visual Question Answering
Zhang, Zixiao
Jiao, Licheng
Li, Lingling
Liu, Xu
Chen, Puhua
Liu, Fang
Li, Yuxuan
Guo, Zhicheng
[J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[38] Cascade Reasoning Network for Text-based Visual Question Answering
Liu, Fen
Xu, Guanghui
Wu, Qi
Du, Qing
Jia, Wei
Tan, Mingkui
[J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4060 - 4069
[39] DisAVR: Disentangled Adaptive Visual Reasoning Network for Diagram Question Answering
Wang, Yaxian
Wei, Bifan
Liu, Jun
Zhang, Lingling
Wang, Jiaxin
Wang, Qianying
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 4812 - 4827
[40] Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering
Gong, Haifan
Chen, Guanqi
Liu, Sishuo
Yu, Yizhou
Li, Guanbin
[J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 456 - 460

← 1 2 3 4 5 →