Cross-modal Relational Reasoning Network for Visual Question Answering

被引：5

作者：

Chen, Hongyu ^{[1
]}

Liu, Ruifang ^{[1
]}

Peng, Bo ^{[2
]}

机构：

[1] Beijing Univ Posts & Telecommun, Beijing, Peoples R China

[2] Tencent, Shenyang, Peoples R China

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021) | 2021年

关键词：

D O I：

10.1109/ICCVW54120.2021.00441

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual Question Answering (VQA) is a challenging task that requires a cross-modal understanding of images and questions with relational reasoning leading to the correct answer. To bridge the semantic gap between these two modalities, previous works focus on the word-region alignments of all possible pairs without attending more attention to the corresponding word and object. Treating all pairs equally without consideration of relation consistency hinders the model's performance. In this paper, to align the relation-consistent pairs and integrate the interpretability of VQA systems, we propose a Cross-modal Relational Reasoning Network (CRRN), to mask the inconsistent attention map and highlight the full latent alignments of corresponding word-region pairs. Specifically, we present two relational masks for inter-modal and intra-modal highlighting, inferring the more and less important words in sentences or regions in images. The attention interrelationship of consistent pairs can be enhanced with the shift of learning focus by masking the unaligned relations. Then, we propose two novel losses L-CMAM and L-SMAM with explicit supervision to capture the fine-grained interplay between vision and language. We have conduct thorough experiments to prove the effectiveness and achieve the competitive performance for reaching 61.74% on GQA benchmark.

引用

页码：3939 / 3948

页数：10

共 50 条

[1] Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering
Liu, Yang
Li, Guanbin
Lin, Liang
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (10) : 11624 - 11641
[2] Reasoning on the Relation: Enhancing Visual Representation for Visual Question Answering and Cross-Modal Retrieval
Yu, Jing
Zhang, Weifeng
Lu, Yuhang
Qin, Zengchang
Hu, Yue
Tan, Jianlong
Wu, Qi
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (12) : 3196 - 3209
[3] Cross-modal knowledge reasoning for knowledge-based visual question answering
Yu, Jing
Zhu, Zihao
Wang, Yujing
Zhang, Weifeng
Hu, Yue
Tan, Jianlong
[J]. PATTERN RECOGNITION, 2020, 108
[4] Cross-Modal Visual Question Answering for Remote Sensing Data
Felix, Rafael
Repasky, Boris
Hodge, Samuel
Zolfaghari, Reza
Abbasnejad, Ehsan
Sherrah, Jamie
[J]. 2021 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA 2021), 2021, : 57 - 65
[5] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
Lao, Mingrui
Guo, Yanming
Wang, Hui
Zhang, Xin
[J]. IEEE ACCESS, 2018, 6 : 31516 - 31524
[6] VCD: Visual Causality Discovery for Cross-Modal Question Reasoning
Liu, Yang
Tan, Ying
Luo, Jingzhou
Chen, Weixing
[J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VII, 2024, 14431 : 309 - 322
[7] Visual question answering with attention transfer and a cross-modal gating mechanism
Li, Wei
Sun, Jianhui
Liu, Ge
Zhao, Linglan
Fang, Xiangzhong
[J]. PATTERN RECOGNITION LETTERS, 2020, 133 : 334 - 340
[8] Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
Lerner, Paul
Ferret, Olivier
Guinaudeau, Camille
[J]. ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 421 - 438
[9] Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering
Li, Yong
Yang, Qihao
Wang, Fu Lee
Lee, Lap-Kei
Qu, Yingying
Hao, Tianyong
[J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2023, 144
[10] Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering
Zhu, Zihao
Yu, Jing
Wang, Yujing
Sun, Yajing
Hu, Yue
Wu, Qi
[J]. PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1097 - 1103

← 1 2 3 4 5 →