Connecting Multi-modal Contrastive Representations

被引:0
|
作者
Wang, Zehan [1 ]
Zhao, Yang [2 ]
Cheng, Xize [1 ]
Huang, Haifeng [1 ]
Liu, Jiageng [1 ]
Tang, Li [1 ]
Li, Linjun [1 ]
Wang, Yongqi [1 ]
Yin, Aoxiong [1 ]
Zhang, Ziang [1 ]
Zhao, Zhou [1 ,3 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] ByteDance, Beijing, Peoples R China
[3] Shanghai AI Lab, Shanghai, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-modal Contrastive Representation (MCR) learning aims to encode different modalities into a semantically aligned shared space. This paradigm shows remarkable generalization ability on numerous downstream tasks across various modalities. However, the reliance on massive high-quality data pairs limits its further development on more modalities. This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR). Specifically, given two existing MCRs pre-trained on (A, B) and (B, C) modality pairs, we project them to a new space and use the data from the overlapping modality B to aligning the two MCRs in the new space. Meanwhile, since the modality pairs (A, B) and (B, C) are already aligned within each MCR, the connection learned by overlapping modality can also be transferred to non-overlapping modality pair (A, C). To unleash the potential of C-MCR, we further introduce a semantic-enhanced inter- and intra-MCR connection method. We first enhance the semantic consistency and completion of embeddings across different modalities for more robust alignment. Then we utilize the inter-MCR alignment to establish the connection, and employ the intra-MCR alignment to better maintain the connection for inputs from non-overlapping modalities. To demonstrate the effectiveness of C-MCR, we take the field of audio-visual and 3D-language learning as examples. Specifically, we connect CLIP and CLAP via texts to derive audio-visual representations, and integrate CLIP and ULIP via images for 3D-language representations. Remarkably, without using any paired data, C-MCR for audio-visual achieves state-of-the-art performance on audio-image retrieval, audio-visual source localization, and counterfactual audio-image recognition tasks. Furthermore, C-MCR for 3D-language also attains advanced zero-shot 3D point cloud classification accuracy on ModelNet40. Our project page is available at https://c- mcr.github.io/C- MCR/
引用
收藏
页数:16
相关论文
共 50 条
  • [1] CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations
    Zolfaghari, Mohammadreza
    Zhu, Yi
    Gehler, Peter
    Brox, Thomas
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1430 - 1439
  • [2] Contrastive Adversarial Training for Multi-Modal Machine Translation
    Huang, Xin
    Zhang, Jiajun
    Zong, Chengqing
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (06)
  • [3] Multi-modal Contrastive Learning for Healthcare Data Analytics
    Li, Rui
    Gao, Jing
    [J]. 2022 IEEE 10TH INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI 2022), 2022, : 120 - 127
  • [4] Contrastive Multi-Modal Knowledge Graph Representation Learning
    Fang, Quan
    Zhang, Xiaowei
    Hu, Jun
    Wu, Xian
    Xu, Changsheng
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (09) : 8983 - 8996
  • [5] Multi-Modal Contrastive Pre-training for Recommendation
    Liu, Zhuang
    Ma, Yunpu
    Schubert, Matthias
    Ouyang, Yuanxin
    Xiong, Zhang
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 99 - 108
  • [6] Deep contrastive representation learning for multi-modal clustering
    Lu, Yang
    Li, Qin
    Zhang, Xiangdong
    Gao, Quanxue
    [J]. NEUROCOMPUTING, 2024, 581
  • [7] Improving Code Search with Multi-Modal Momentum Contrastive Learning
    Shi, Zejian
    Xiong, Yun
    Zhang, Yao
    Jiang, Zhijie
    Zhao, Jinjing
    Wang, Lei
    Li, Shanshan
    [J]. 2023 IEEE/ACM 31ST INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC, 2023, : 280 - 291
  • [8] A Multi-Modal Contrastive Diffusion Model for Therapeutic Peptide Generation
    Wang, Yongkang
    Liu, Xuan
    Huang, Feng
    Xiong, Zhankun
    Zhang, Wen
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 1, 2024, : 3 - 11
  • [9] Multi-modal graph contrastive encoding for neural machine translation
    Yin, Yongjing
    Zeng, Jiali
    Su, Jinsong
    Zhou, Chulun
    Meng, Fandong
    Zhou, Jie
    Huang, Degen
    Luo, Jiebo
    [J]. ARTIFICIAL INTELLIGENCE, 2023, 323
  • [10] CrossMoCo: Multi-modal Momentum Contrastive Learning for Point Cloud
    Paul, Sneha
    Patterson, Zachary
    Bouguila, Nizar
    [J]. 2023 20TH CONFERENCE ON ROBOTS AND VISION, CRV, 2023, : 273 - 280