Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering

被引：1

作者：

Fu, Ze ^{[1
,2
]}

Zheng, Changmeng ^{[1
,2
]}

Cai, Yi ^{[1
,2
]}

Li, Qing ^{[3
]}

Wang, Tao ^{[4
]}

机构：

[1] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China

[2] MOE China, Key Lab Big Data & Intelligent Robot SCUT, Shanghai, Peoples R China

[3] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China

[4] Kings Coll London, Inst Psychiat Psychol & Neurosci, Dept Biostat & Hlth Informat, London, England

来源：

WEB AND BIG DATA, APWEB-WAIM 2021, PT I | 2021年 / 12858卷

基金：

中国国家自然科学基金;

关键词：

Visual question answering; Domain adaptation; Modality-invariant co-learning;

D O I：

10.1007/978-3-030-85896-4_25

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual Question Answering (VQA) is a typical multimodal task with significant development prospect on web application. In order to answer the question based on the corresponding image, a VQA model needs to utilize the information from different modality efficiently. Although the multimodal fusion methods such as attention mechanism make significant contribution for VQA, these methods try to co-learn the multimodal features directly, ignoring the large gap between different modality and thus poor aligning the semantic. In this paper, we propose a Cross-Modality Adversarial Network (CMAN) to address this limitation. Our method combines cross-modality adversarial learning with modality-invariant attention learning aiming to learn the modality-invariant features for better semantic alignment and higher answer prediction accuracy. The accuracy of model achieves 70.81% on the test-dev split on the VQA-v2 dataset. Our results also show that the model narrows the gap between different modalities effectively and improves the alignment performance of the multimodal information.

引用

页码：316 / 331

页数：16

共 50 条

[31] Self-attention Cross-modality Fusion Network for Cross-modality Person Re-identification
Du, Peng
Song, Yong-Hong
Zhang, Xin-Yao
[J]. Zidonghua Xuebao/Acta Automatica Sinica, 2022, 48 (06): : 1457 - 1468
[32] Multi-Modality Global Fusion Attention Network for Visual Question Answering
Yang, Cheng
Wu, Weijia
Wang, Yuxing
Zhou, Hong
[J]. ELECTRONICS, 2020, 9 (11) : 1 - 12
[33] CROSS-MODALITY POSE-INVARIANT FACIAL EXPRESSION
Hashemi, Jordan
Qiu, Qiang
Sapiro, Guillermo
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2015, : 4007 - 4011
[34] Adversarial Decoupling and Modality-Invariant Representation Learning for Visible-Infrared Person Re-Identification
Hu, Weipeng
Liu, Bohong
Zeng, Haitang
Hou, Yanke
Hu, Haifeng
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) : 5095 - 5109
[35] MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
Hu, Yuchen
Chen, Chen
Li, Ruizhe
Zou, Heqing
Chng, Eng Siong
[J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 11610 - 11625
[36] Learning enhancing modality-invariant features for visible-infrared person re-identification
Zhang, La
Zhao, Xu
Du, Haohua
Sun, Jian
Wang, Jinqiao
[J]. INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024,
[37] Cross-modality collaborative learning identified pedestrian
Wen, Xiongjun
Feng, Xin
Li, Ping
Chen, Wenfang
[J]. VISUAL COMPUTER, 2023, 39 (09): : 4117 - 4132
[38] Learning Cross-modality Similarity for Multinomial Data
Jia, Yangqing
Salzmann, Mathieu
Darrell, Trevor
[J]. 2011 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2011, : 2407 - 2414
[39] Learning Diffeomorphic and Modality-invariant Registration using B-splines
Qiu, Huaqi
Qin, Chen
Schuh, Andreas
Hammernik, Kerstin
Rueckert, Daniel
[J]. MEDICAL IMAGING WITH DEEP LEARNING, VOL 143, 2021, 143 : 645 - 663
[40] CROSS-MODALITY DISTILLATION: A CASE FOR CONDITIONAL GENERATIVE ADVERSARIAL NETWORKS
Roheda, Siddharth
Riggan, Benjamin S.
Krim, Hamid
Dai, Liyi
[J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2926 - 2930

← 1 2 3 4 5 →