Cross-modal Relational Reasoning Network for Visual Question Answering

被引:5
|
作者
Chen, Hongyu [1 ]
Liu, Ruifang [1 ]
Peng, Bo [2 ]
机构
[1] Beijing Univ Posts & Telecommun, Beijing, Peoples R China
[2] Tencent, Shenyang, Peoples R China
关键词
D O I
10.1109/ICCVW54120.2021.00441
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) is a challenging task that requires a cross-modal understanding of images and questions with relational reasoning leading to the correct answer. To bridge the semantic gap between these two modalities, previous works focus on the word-region alignments of all possible pairs without attending more attention to the corresponding word and object. Treating all pairs equally without consideration of relation consistency hinders the model's performance. In this paper, to align the relation-consistent pairs and integrate the interpretability of VQA systems, we propose a Cross-modal Relational Reasoning Network (CRRN), to mask the inconsistent attention map and highlight the full latent alignments of corresponding word-region pairs. Specifically, we present two relational masks for inter-modal and intra-modal highlighting, inferring the more and less important words in sentences or regions in images. The attention interrelationship of consistent pairs can be enhanced with the shift of learning focus by masking the unaligned relations. Then, we propose two novel losses L-CMAM and L-SMAM with explicit supervision to capture the fine-grained interplay between vision and language. We have conduct thorough experiments to prove the effectiveness and achieve the competitive performance for reaching 61.74% on GQA benchmark.
引用
收藏
页码:3939 / 3948
页数:10
相关论文
共 50 条
  • [31] Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning
    Zhang, Xi
    Zhang, Feifei
    Xu, Changsheng
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 2986 - 2997
  • [32] Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering
    Lyu, Chenyang
    Li, Wenxi
    Ji, Tianbo
    Zhou, Liting
    Gurrin, Cathal
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 427 - 438
  • [33] Sequential Visual Reasoning for Visual Question Answering
    Liu, Jinlai
    Wu, Chenfei
    Wang, Xiaojie
    Dong, Xuan
    [J]. PROCEEDINGS OF 2018 5TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (CCIS), 2018, : 410 - 415
  • [34] CroMIC-QA: The Cross-Modal Information Complementation Based Question Answering
    Qian, Shun
    Liu, Bingquan
    Sun, Chengjie
    Xu, Zhen
    Ma, Lin
    Wang, Baoxun
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8348 - 8359
  • [35] Chain of Reasoning for Visual Question Answering
    Wu, Chenfei
    Liu, Jinlai
    Wang, Xiaojie
    Dong, Xuan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [36] Robust visual question answering via semantic cross modal augmentation
    Mashrur, Akib
    Luo, Wei
    Zaidi, Nayyar A.
    Robles-Kelly, Antonio
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 238
  • [37] A Spatial Hierarchical Reasoning Network for Remote Sensing Visual Question Answering
    Zhang, Zixiao
    Jiao, Licheng
    Li, Lingling
    Liu, Xu
    Chen, Puhua
    Liu, Fang
    Li, Yuxuan
    Guo, Zhicheng
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [38] Cascade Reasoning Network for Text-based Visual Question Answering
    Liu, Fen
    Xu, Guanghui
    Wu, Qi
    Du, Qing
    Jia, Wei
    Tan, Mingkui
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4060 - 4069
  • [39] DisAVR: Disentangled Adaptive Visual Reasoning Network for Diagram Question Answering
    Wang, Yaxian
    Wei, Bifan
    Liu, Jun
    Zhang, Lingling
    Wang, Jiaxin
    Wang, Qianying
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 4812 - 4827
  • [40] Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering
    Gong, Haifan
    Chen, Guanqi
    Liu, Sishuo
    Yu, Yizhou
    Li, Guanbin
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 456 - 460