Learning from the global view: Supervised contrastive learning of multimodal representation

被引:9
|
作者
Mai, Sijie [1 ]
Zeng, Ying [1 ]
Hu, Haifeng [1 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510006, Guangdong, Peoples R China
关键词
Multimodal sentiment analysis; Multimodal representation learning; Contrastive learning; Multimodal humor detection; FUSION;
D O I
10.1016/j.inffus.2023.101920
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The development of technology enables the availability of abundant multimodal data, which can be utilized in many representation learning tasks. However, most methods ignore the rich modality correlation information stored in each multimodal object and fail to fully exploit the potential of multimodal data. To address the aforementioned issue, cross-modal contrastive learning methods are proposed to learn the similarity score of each modality pair in a self-/weakly-supervised manner and improve the model robustness. Though effective, contrastive learning based on unimodal representations might be, in some cases, inaccurate as unimodal representations fail to reveal the global information of multimodal objects. To this end, we propose a contrastive learning pipeline based on multimodal representations to learn from the global view, and devise multiple techniques to generate negative and positive samples for each anchor. To generate positive samples, we apply the mix-up operation to mix two multimodal representations of different objects that have the maximal label similarity. Moreover, we devise a permutation-invariant fusion mechanism to define the positive samples by permuting the input order of modalities for fusion and sampling various contrastive fusion networks. In this way, we force the multimodal representation to be invariant regarding the order of modalities and the structures of fusion networks, so that the model can capture high-level semantic information of multimodal objects. To define negative samples, for each modality, we randomly replace the unimodal representation with that from another dissimilar object when synthesizing the multimodal representation. By this means, the model is led to capture the high-level concurrence information and correspondence relationship between modalities within each object. We also directly define the multimodal representation from another object as a negative sample, where the chosen object shares the minimal label similarity with the anchor. The label information is leveraged in the proposed framework to learn a more discriminative multimodal embedding space for downstream tasks. Extensive experiments demonstrate that our method outperforms previous state-of-the-art baselines on the tasks of multimodal sentiment analysis and humor detection.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] CONTRASTIVE SEPARATIVE CODING FOR SELF-SUPERVISED REPRESENTATION LEARNING
    Wang, Jun
    Lam, Max W. Y.
    Su, Dan
    Yu, Dong
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3865 - 3869
  • [22] Grouped Contrastive Learning of Self-Supervised Sentence Representation
    Wang, Qian
    Zhang, Weiqi
    Lei, Tianyi
    Peng, Dezhong
    APPLIED SCIENCES-BASEL, 2023, 13 (17):
  • [23] Learning From Crowds With Contrastive Representation
    Yang, Hang
    Li, Xunbo
    Pedrycz, Witold
    IEEE ACCESS, 2023, 11 : 40182 - 40191
  • [24] Extracting Multimodal Embeddings via Supervised Contrastive Learning for Psychological Screening
    Kalanadhabhatta, Manasa
    Santana, Adrelys Mateo
    Ganesan, Deepak
    Rahman, Tauhidur
    Grabell, Adam
    2022 10TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2022,
  • [25] Self-supervised Graph-level Representation Learning with Adversarial Contrastive Learning
    Luo, Xiao
    Ju, Wei
    Gu, Yiyang
    Mao, Zhengyang
    Liu, Luchen
    Yuan, Yuhui
    Zhang, Ming
    ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2024, 18 (02)
  • [26] TimesURL: Self-Supervised Contrastive Learning for Universal Time Series Representation Learning
    Liu, Jiexi
    Chen, Songcan
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 12, 2024, : 13918 - 13926
  • [27] Mixing up contrastive learning: Self-supervised representation learning for time series
    Wickstrom, Kristoffer
    Kampffmeyer, Michael
    Mikalsen, Karl Oyvind
    Jenssen, Robert
    PATTERN RECOGNITION LETTERS, 2022, 155 : 54 - 61
  • [28] Connectivity-Contrastive Learning: Combining Causal Discovery and Representation Learning for Multimodal Data
    Morioka, Hiroshi
    Hyvarinen, Aapo
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 206, 2023, 206
  • [29] Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training
    Dave, Vedant
    Lygerakis, Fotios
    Rueckert, Elmar
    2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA 2024, 2024, : 8013 - 8020
  • [30] Improving Multimodal Sentiment Analysis: Supervised Angular Margin-based Contrastive Learning for Enhanced Fusion Representation
    Nguyen, Cong-Duy
    Nguyen, Thong
    Vu, Duc Anh
    Tuan, Luu Anh
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 14714 - 14724