Jointly Learning Attentions with Semantic Cross-Modal Correlation for Visual Question Answering

被引:2
|
作者
Cao, Liangfu [1 ]
Gao, Lianli [1 ]
Song, Jingkuan [1 ]
Xu, Xing [1 ]
Shen, Heng Tao [1 ]
机构
[1] Univ Elect Sci & Technol China, Chengdu, Sichuan, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
D O I
10.1007/978-3-319-68155-9_19
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual Question Answering (VQA) has emerged as a prominent multi-discipline research problem in artificial intelligence. A number of recent studies are focusing on proposing attention mechanisms such as visual attention ("where to look") or question attention ("what words to listen to"), and they have been proved to be efficient for VQA. However, they focus on modeling the prediction error, but ignore the semantic correlation between image attention and question attention. As a result, it will inevitably result in suboptimal attentions. In this paper, we argue that in addition to modeling visual and question attentions, it is equally important to model their semantic correlation to learn them jointly as well as to facilitate their joint representation learning for VQA. In this paper, we propose a novel end-to-end model to jointly learn attentions with semantic cross-modal correlation for efficiently solving the VQA problem. Specifically, we propose a multi-modal embedding to map the visual and question attentions into a joint space to guarantee their semantic consistency. Experimental results on the benchmark datasets demonstrate that our model outperforms several state-of-the-art techniques for VQA.
引用
收藏
页码:248 / 260
页数:13
相关论文
共 50 条
  • [1] Cross-Modal Visual Question Answering for Remote Sensing Data
    Felix, Rafael
    Repasky, Boris
    Hodge, Samuel
    Zolfaghari, Reza
    Abbasnejad, Ehsan
    Sherrah, Jamie
    [J]. 2021 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA 2021), 2021, : 57 - 65
  • [2] Cross-modal Relational Reasoning Network for Visual Question Answering
    Chen, Hongyu
    Liu, Ruifang
    Peng, Bo
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3939 - 3948
  • [3] Visual question answering with attention transfer and a cross-modal gating mechanism
    Li, Wei
    Sun, Jianhui
    Liu, Ge
    Zhao, Linglan
    Fang, Xiangzhong
    [J]. PATTERN RECOGNITION LETTERS, 2020, 133 (133) : 334 - 340
  • [4] Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
    Lerner, Paul
    Ferret, Olivier
    Guinaudeau, Camille
    [J]. ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 421 - 438
  • [5] Robust visual question answering via semantic cross modal augmentation
    Mashrur, Akib
    Luo, Wei
    Zaidi, Nayyar A.
    Robles-Kelly, Antonio
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 238
  • [6] Reasoning on the Relation: Enhancing Visual Representation for Visual Question Answering and Cross-Modal Retrieval
    Yu, Jing
    Zhang, Weifeng
    Lu, Yuhang
    Qin, Zengchang
    Hu, Yue
    Tan, Jianlong
    Wu, Qi
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (12) : 3196 - 3209
  • [7] Cross-modal knowledge reasoning for knowledge-based visual question answering
    Yu, Jing
    Zhu, Zihao
    Wang, Yujing
    Zhang, Weifeng
    Hu, Yue
    Tan, Jianlong
    [J]. PATTERN RECOGNITION, 2020, 108
  • [8] Cross-Modal Dense Passage Retrieval for Outside Knowledge Visual Question Answering
    Reichman, Benjamin
    Heck, Larry
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2829 - 2834
  • [9] Medical visual question answering with symmetric interaction attention and cross-modal gating
    Chen, Zhi
    Zou, Beiji
    Dai, Yulan
    Zhu, Chengzhang
    Kong, Guilan
    Zhang, Wensheng
    [J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2023, 85
  • [10] Structured Attentions for Visual Question Answering
    Zhu, Chen
    Zhao, Yanpeng
    Huang, Shuaiyi
    Tu, Kewei
    Ma, Yi
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1300 - 1309