Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering

被引:1
|
作者
Fu, Ze [1 ,2 ]
Zheng, Changmeng [1 ,2 ]
Cai, Yi [1 ,2 ]
Li, Qing [3 ]
Wang, Tao [4 ]
机构
[1] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China
[2] MOE China, Key Lab Big Data & Intelligent Robot SCUT, Shanghai, Peoples R China
[3] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China
[4] Kings Coll London, Inst Psychiat Psychol & Neurosci, Dept Biostat & Hlth Informat, London, England
来源
基金
中国国家自然科学基金;
关键词
Visual question answering; Domain adaptation; Modality-invariant co-learning;
D O I
10.1007/978-3-030-85896-4_25
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) is a typical multimodal task with significant development prospect on web application. In order to answer the question based on the corresponding image, a VQA model needs to utilize the information from different modality efficiently. Although the multimodal fusion methods such as attention mechanism make significant contribution for VQA, these methods try to co-learn the multimodal features directly, ignoring the large gap between different modality and thus poor aligning the semantic. In this paper, we propose a Cross-Modality Adversarial Network (CMAN) to address this limitation. Our method combines cross-modality adversarial learning with modality-invariant attention learning aiming to learn the modality-invariant features for better semantic alignment and higher answer prediction accuracy. The accuracy of model achieves 70.81% on the test-dev split on the VQA-v2 dataset. Our results also show that the model narrows the gap between different modalities effectively and improves the alignment performance of the multimodal information.
引用
收藏
页码:316 / 331
页数:16
相关论文
共 50 条
  • [31] Self-attention Cross-modality Fusion Network for Cross-modality Person Re-identification
    Du, Peng
    Song, Yong-Hong
    Zhang, Xin-Yao
    [J]. Zidonghua Xuebao/Acta Automatica Sinica, 2022, 48 (06): : 1457 - 1468
  • [32] Multi-Modality Global Fusion Attention Network for Visual Question Answering
    Yang, Cheng
    Wu, Weijia
    Wang, Yuxing
    Zhou, Hong
    [J]. ELECTRONICS, 2020, 9 (11) : 1 - 12
  • [33] CROSS-MODALITY POSE-INVARIANT FACIAL EXPRESSION
    Hashemi, Jordan
    Qiu, Qiang
    Sapiro, Guillermo
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2015, : 4007 - 4011
  • [34] Adversarial Decoupling and Modality-Invariant Representation Learning for Visible-Infrared Person Re-Identification
    Hu, Weipeng
    Liu, Bohong
    Zeng, Haitang
    Hou, Yanke
    Hu, Haifeng
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) : 5095 - 5109
  • [35] MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
    Hu, Yuchen
    Chen, Chen
    Li, Ruizhe
    Zou, Heqing
    Chng, Eng Siong
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 11610 - 11625
  • [36] Learning enhancing modality-invariant features for visible-infrared person re-identification
    Zhang, La
    Zhao, Xu
    Du, Haohua
    Sun, Jian
    Wang, Jinqiao
    [J]. INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024,
  • [37] Cross-modality collaborative learning identified pedestrian
    Wen, Xiongjun
    Feng, Xin
    Li, Ping
    Chen, Wenfang
    [J]. VISUAL COMPUTER, 2023, 39 (09): : 4117 - 4132
  • [38] Learning Cross-modality Similarity for Multinomial Data
    Jia, Yangqing
    Salzmann, Mathieu
    Darrell, Trevor
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2011, : 2407 - 2414
  • [39] Learning Diffeomorphic and Modality-invariant Registration using B-splines
    Qiu, Huaqi
    Qin, Chen
    Schuh, Andreas
    Hammernik, Kerstin
    Rueckert, Daniel
    [J]. MEDICAL IMAGING WITH DEEP LEARNING, VOL 143, 2021, 143 : 645 - 663
  • [40] CROSS-MODALITY DISTILLATION: A CASE FOR CONDITIONAL GENERATIVE ADVERSARIAL NETWORKS
    Roheda, Siddharth
    Riggan, Benjamin S.
    Krim, Hamid
    Dai, Liyi
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2926 - 2930