Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering

被引:1
|
作者
Fu, Ze [1 ,2 ]
Zheng, Changmeng [1 ,2 ]
Cai, Yi [1 ,2 ]
Li, Qing [3 ]
Wang, Tao [4 ]
机构
[1] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China
[2] MOE China, Key Lab Big Data & Intelligent Robot SCUT, Shanghai, Peoples R China
[3] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China
[4] Kings Coll London, Inst Psychiat Psychol & Neurosci, Dept Biostat & Hlth Informat, London, England
来源
基金
中国国家自然科学基金;
关键词
Visual question answering; Domain adaptation; Modality-invariant co-learning;
D O I
10.1007/978-3-030-85896-4_25
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) is a typical multimodal task with significant development prospect on web application. In order to answer the question based on the corresponding image, a VQA model needs to utilize the information from different modality efficiently. Although the multimodal fusion methods such as attention mechanism make significant contribution for VQA, these methods try to co-learn the multimodal features directly, ignoring the large gap between different modality and thus poor aligning the semantic. In this paper, we propose a Cross-Modality Adversarial Network (CMAN) to address this limitation. Our method combines cross-modality adversarial learning with modality-invariant attention learning aiming to learn the modality-invariant features for better semantic alignment and higher answer prediction accuracy. The accuracy of model achieves 70.81% on the test-dev split on the VQA-v2 dataset. Our results also show that the model narrows the gap between different modalities effectively and improves the alignment performance of the multimodal information.
引用
收藏
页码:316 / 331
页数:16
相关论文
共 50 条
  • [1] Exploring modality-shared appearance features and modality-invariant relation features for cross-modality person Re-IDentification
    Huang, Nianchang
    Liu, Jianan
    Luo, Yongjiang
    Zhang, Qiang
    Han, Jungong
    [J]. PATTERN RECOGNITION, 2023, 135
  • [2] Bridging the Cross-Modality Semantic Gap in Visual Question Answering
    Wang, Boyue
    Ma, Yujian
    Li, Xiaoyan
    Gao, Junbin
    Hu, Yongli
    Yin, Baocai
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 13
  • [3] Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering
    Wang, Yan
    Li, Peize
    Si, Qingyi
    Zhang, Hanwen
    Zang, Wenyu
    Lin, Zheng
    Fu, Peng
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (03)
  • [4] Cross-modality co-attention networks for visual question answering
    Dezhi Han
    Shuli Zhou
    Kuan Ching Li
    Rodrigo Fernandes de Mello
    [J]. Soft Computing, 2021, 25 : 5411 - 5421
  • [5] Cross-modality co-attention networks for visual question answering
    Han, Dezhi
    Zhou, Shuli
    Li, Kuan Ching
    de Mello, Rodrigo Fernandes
    [J]. SOFT COMPUTING, 2021, 25 (07) : 5411 - 5421
  • [6] Enhanced Invariant Feature Joint Learning via Modality-Invariant Neighbor Relations for Cross-Modality Person Re-Identification
    Du, Guodong
    Zhang, Liyan
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (04) : 2361 - 2373
  • [7] Learning Modality-Invariant Features for Heterogeneous Face Recognition
    Huang, Likun
    Lu, Jiwen
    Tan, Yap-Peng
    [J]. 2012 21ST INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR 2012), 2012, : 1683 - 1686
  • [8] Robust video question answering via contrastive cross-modality representation learning
    Xun YANG
    Jianming ZENG
    Dan GUO
    Shanshan WANG
    Jianfeng DONG
    Meng WANG
    [J]. Science China(Information Sciences)., 2024, 67 (10) - 226
  • [9] Robust video question answering via contrastive cross-modality representation learning
    Yang, Xun
    Zeng, Jianming
    Guo, Dan
    Wang, Shanshan
    Dong, Jianfeng
    Wang, Meng
    [J]. SCIENCE CHINA-INFORMATION SCIENCES, 2024, 67 (10)
  • [10] Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering
    Cao, Jianjian
    Qin, Xiameng
    Zhao, Sanyuan
    Shen, Jianbing
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022,