Learning from the global view: Supervised contrastive learning of multimodal representation

被引:9
|
作者
Mai, Sijie [1 ]
Zeng, Ying [1 ]
Hu, Haifeng [1 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510006, Guangdong, Peoples R China
关键词
Multimodal sentiment analysis; Multimodal representation learning; Contrastive learning; Multimodal humor detection; FUSION;
D O I
10.1016/j.inffus.2023.101920
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The development of technology enables the availability of abundant multimodal data, which can be utilized in many representation learning tasks. However, most methods ignore the rich modality correlation information stored in each multimodal object and fail to fully exploit the potential of multimodal data. To address the aforementioned issue, cross-modal contrastive learning methods are proposed to learn the similarity score of each modality pair in a self-/weakly-supervised manner and improve the model robustness. Though effective, contrastive learning based on unimodal representations might be, in some cases, inaccurate as unimodal representations fail to reveal the global information of multimodal objects. To this end, we propose a contrastive learning pipeline based on multimodal representations to learn from the global view, and devise multiple techniques to generate negative and positive samples for each anchor. To generate positive samples, we apply the mix-up operation to mix two multimodal representations of different objects that have the maximal label similarity. Moreover, we devise a permutation-invariant fusion mechanism to define the positive samples by permuting the input order of modalities for fusion and sampling various contrastive fusion networks. In this way, we force the multimodal representation to be invariant regarding the order of modalities and the structures of fusion networks, so that the model can capture high-level semantic information of multimodal objects. To define negative samples, for each modality, we randomly replace the unimodal representation with that from another dissimilar object when synthesizing the multimodal representation. By this means, the model is led to capture the high-level concurrence information and correspondence relationship between modalities within each object. We also directly define the multimodal representation from another object as a negative sample, where the chosen object shares the minimal label similarity with the anchor. The label information is leveraged in the proposed framework to learn a more discriminative multimodal embedding space for downstream tasks. Extensive experiments demonstrate that our method outperforms previous state-of-the-art baselines on the tasks of multimodal sentiment analysis and humor detection.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Geometric Multimodal Contrastive Representation Learning
    Poklukar, Petra
    Vasco, Miguel
    Yin, Hang
    Melo, Francisco S.
    Paiva, Ana
    Kragic, Danica
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [2] Supervised contrastive learning for graph representation enhancement
    Ghayekhloo, Mohadeseh
    Nickabadi, Ahmad
    NEUROCOMPUTING, 2024, 588
  • [3] Contrastive Supervised Distillation for Continual Representation Learning
    Barletti, Tommaso
    Biondi, Niccolo
    Pernici, Federico
    Bruni, Matteo
    Del Bimbo, Alberto
    IMAGE ANALYSIS AND PROCESSING, ICIAP 2022, PT I, 2022, 13231 : 597 - 609
  • [4] Deep contrastive representation learning for supervised tasks
    Duan, Chenguang
    Jiao, Yuling
    Kang, Lican
    Yang, Jerry Zhijian
    Zhou, Fusheng
    PATTERN RECOGNITION, 2025, 161
  • [5] Multimodal Contrastive Training for Visual Representation Learning
    Yuan, Xin
    Lin, Zhe
    Kuen, Jason
    Zhang, Jianming
    Wang, Yilin
    Maire, Michael
    Kale, Ajinkya
    Faieta, Baldo
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 6991 - 7000
  • [6] Semi-Supervised Multimodal Representation Learning Through a Global Workspace
    Devillers, Benjamin
    Maytie, Leopold
    VanRullen, Rufin
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,
  • [7] Cross-View Temporal Contrastive Learning for Self-Supervised Video Representation
    Wang, Lulu
    Xu, Zengmin
    Zhang, Xuelian
    Meng, Ruxing
    Lu, Tao
    Computer Engineering and Applications, 2024, 60 (18) : 158 - 166
  • [8] Robust Representation Learning for Multimodal Emotion Recognition with Contrastive Learning and Mixup
    Car, Yunrui
    Ye, Runchuan
    Xie, Jingran
    Zhou, Yixuan
    Xu, Yaoxun
    Wu, Zhiyong
    PROCEEDINGS OF THE 2ND INTERNATIONAL WORKSHOP ON MULTIMODAL AND RESPONSIBLE AFFECTIVE COMPUTING, MRAC 2024, 2024, : 93 - 97
  • [9] Supervised Contrastive Learning for Detecting Anomalous Driving Behaviours from Multimodal Videos
    Khan, Shehroz S.
    Shen, Ziting
    Sun, Haoying
    Patel, Ax
    Abedi, Ali
    2022 19TH CONFERENCE ON ROBOTS AND VISION (CRV 2022), 2022, : 16 - 23
  • [10] Multimodal Supervised Contrastive Learning in Remote Sensing Downstream Tasks
    Berg, Paul
    Uzun, Baki
    Pham, Minh-Tan
    Courty, Nicolas
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21 : 1 - 5