Connecting Multi-modal Contrastive Representations

被引：0

作者：

Wang, Zehan ^{[1
]}

Zhao, Yang ^{[2
]}

Cheng, Xize ^{[1
]}

Huang, Haifeng ^{[1
]}

Liu, Jiageng ^{[1
]}

Tang, Li ^{[1
]}

Li, Linjun ^{[1
]}

Wang, Yongqi ^{[1
]}

Yin, Aoxiong ^{[1
]}

Zhang, Ziang ^{[1
]}

Zhao, Zhou ^{[1
,3
]}

机构：

[1] Zhejiang Univ, Hangzhou, Peoples R China

[2] ByteDance, Beijing, Peoples R China

[3] Shanghai AI Lab, Shanghai, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multi-modal Contrastive Representation (MCR) learning aims to encode different modalities into a semantically aligned shared space. This paradigm shows remarkable generalization ability on numerous downstream tasks across various modalities. However, the reliance on massive high-quality data pairs limits its further development on more modalities. This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR). Specifically, given two existing MCRs pre-trained on (A, B) and (B, C) modality pairs, we project them to a new space and use the data from the overlapping modality B to aligning the two MCRs in the new space. Meanwhile, since the modality pairs (A, B) and (B, C) are already aligned within each MCR, the connection learned by overlapping modality can also be transferred to non-overlapping modality pair (A, C). To unleash the potential of C-MCR, we further introduce a semantic-enhanced inter- and intra-MCR connection method. We first enhance the semantic consistency and completion of embeddings across different modalities for more robust alignment. Then we utilize the inter-MCR alignment to establish the connection, and employ the intra-MCR alignment to better maintain the connection for inputs from non-overlapping modalities. To demonstrate the effectiveness of C-MCR, we take the field of audio-visual and 3D-language learning as examples. Specifically, we connect CLIP and CLAP via texts to derive audio-visual representations, and integrate CLIP and ULIP via images for 3D-language representations. Remarkably, without using any paired data, C-MCR for audio-visual achieves state-of-the-art performance on audio-image retrieval, audio-visual source localization, and counterfactual audio-image recognition tasks. Furthermore, C-MCR for 3D-language also attains advanced zero-shot 3D point cloud classification accuracy on ModelNet40. Our project page is available at https://c- mcr.github.io/C- MCR/

引用

页数：16

共 50 条

[1] CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations
Zolfaghari, Mohammadreza
Zhu, Yi
Gehler, Peter
Brox, Thomas
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1430 - 1439
[2] Contrastive Adversarial Training for Multi-Modal Machine Translation
Huang, Xin
Zhang, Jiajun
Zong, Chengqing
[J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (06)
[3] Multi-modal Contrastive Learning for Healthcare Data Analytics
Li, Rui
Gao, Jing
[J]. 2022 IEEE 10TH INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI 2022), 2022, : 120 - 127
[4] Contrastive Multi-Modal Knowledge Graph Representation Learning
Fang, Quan
Zhang, Xiaowei
Hu, Jun
Wu, Xian
Xu, Changsheng
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (09) : 8983 - 8996
[5] Multi-Modal Contrastive Pre-training for Recommendation
Liu, Zhuang
Ma, Yunpu
Schubert, Matthias
Ouyang, Yuanxin
Xiong, Zhang
[J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 99 - 108
[6] Deep contrastive representation learning for multi-modal clustering
Lu, Yang
Li, Qin
Zhang, Xiangdong
Gao, Quanxue
[J]. NEUROCOMPUTING, 2024, 581
[7] Improving Code Search with Multi-Modal Momentum Contrastive Learning
Shi, Zejian
Xiong, Yun
Zhang, Yao
Jiang, Zhijie
Zhao, Jinjing
Wang, Lei
Li, Shanshan
[J]. 2023 IEEE/ACM 31ST INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC, 2023, : 280 - 291
[8] A Multi-Modal Contrastive Diffusion Model for Therapeutic Peptide Generation
Wang, Yongkang
Liu, Xuan
Huang, Feng
Xiong, Zhankun
Zhang, Wen
[J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 1, 2024, : 3 - 11
[9] Multi-modal graph contrastive encoding for neural machine translation
Yin, Yongjing
Zeng, Jiali
Su, Jinsong
Zhou, Chulun
Meng, Fandong
Zhou, Jie
Huang, Degen
Luo, Jiebo
[J]. ARTIFICIAL INTELLIGENCE, 2023, 323
[10] CrossMoCo: Multi-modal Momentum Contrastive Learning for Point Cloud
Paul, Sneha
Patterson, Zachary
Bouguila, Nizar
[J]. 2023 20TH CONFERENCE ON ROBOTS AND VISION, CRV, 2023, : 273 - 280

← 1 2 3 4 5 →