Enhanced Multimodal Representation Learning with Cross-modal KD

被引：3

作者：

Chen, Mengxi ^{[1
]}

Xing, Linyu ^{[1
]}

Wang, Yu ^{[1
,2
]}

Zhang, Ya ^{[1
,2
]}

机构：

[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China

[2] Shanghai AI Lab, Shanghai, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

基金：

国家重点研发计划;

关键词：

NETWORKS;

D O I：

10.1109/CVPR52729.2023.01132

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper explores the tasks of leveraging auxiliary modalities which are only available at training to enhance multimodal representation learning through cross-modal Knowledge Distillation (KD). The widely adopted mutual information maximization-based objective leads to a short-cut solution of the weak teacher, i.e., achieving the maximum mutual information by simply making the teacher model as weak as the student model. To prevent such a weak solution, we introduce an additional objective term, i.e., the mutual information between the teacher and the auxiliary modality model. Besides, to narrow down the information gap between the student and teacher, we further propose to minimize the conditional entropy of the teacher given the student. Novel training schemes based on contrastive learning and adversarial learning are designed to optimize the mutual information and the conditional entropy, respectively. Experimental results on three popular multimodal benchmark datasets have shown that the proposed method outperforms a range of state-of-the-art approaches for video recognition, video retrieval and emotion classification.

引用

页码：11766 / 11775

页数：10

共 50 条

[1] Multimodal Reaction: Information Modulation for Cross-Modal Representation Learning
Zeng, Ying
Mai, Sijie
Yan, Wenjun
Hu, Haifeng
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2178 - 2191
[2] Cross-Modal Discrete Representation Learning
Liu, Alexander H.
Jin, SouYoung
Lai, Cheng-I Jeff
Rouditchenko, Andrew
Oliva, Aude
Glass, James
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 3013 - 3035
[3] Multimodal Graph Learning for Cross-Modal Retrieval
Xie, Jingyou
Zhao, Zishuo
Lin, Zhenzhou
Shen, Ying
PROCEEDINGS OF THE 2023 SIAM INTERNATIONAL CONFERENCE ON DATA MINING, SDM, 2023, : 145 - 153
[4] Quaternion Representation Learning for cross-modal matching
Wang, Zheng
Xu, Xing
Wei, Jiwei
Xie, Ning
Shao, Jie
Yang, Yang
KNOWLEDGE-BASED SYSTEMS, 2023, 270
[5] Hybrid representation learning for cross-modal retrieval
Cao, Wenming
Lin, Qiubin
He, Zhihai
He, Zhiquan
NEUROCOMPUTING, 2019, 345 : 45 - 57
[6] Cross-modal contrastive learning for multimodal sentiment recognition
Yang, Shanliang
Cui, Lichao
Wang, Lei
Wang, Tao
APPLIED INTELLIGENCE, 2024, 54 (05) : 4260 - 4276
[7] Deep Multimodal Transfer Learning for Cross-Modal Retrieval
Zhen, Liangli
Hu, Peng
Peng, Xi
Goh, Rick Siow Mong
Zhou, Joey Tianyi
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (02) : 798 - 810
[8] Scalable Deep Multimodal Learning for Cross-Modal Retrieval
Hu, Peng
Zhen, Liangli
Peng, Dezhong
Liu, Pei
PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, : 635 - 644
[9] Cross-modal contrastive learning for multimodal sentiment recognition
Shanliang Yang
Lichao Cui
Lei Wang
Tao Wang
Applied Intelligence, 2024, 54 : 4260 - 4276
[10] Disentangled Representation Learning for Cross-Modal Biometric Matching
Ning, Hailong
Zheng, Xiangtao
Lu, Xiaoqiang
Yuan, Yuan
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1763 - 1774

← 1 2 3 4 5 →