Self-Supervised Correlation Learning for Cross-Modal Retrieval

被引:24
|
作者
Liu, Yaxin [1 ]
Wu, Jianlong [1 ]
Qu, Leigang [1 ]
Gan, Tian [1 ]
Yin, Jianhua [1 ]
Nie, Liqiang [1 ]
机构
[1] Shandong Univ, Sch Comp Sci & Technol, Qingdao 266237, Peoples R China
基金
中国国家自然科学基金;
关键词
Cross-modal retrieval; self-supervised contrastive learning; mutual information estimation;
D O I
10.1109/TMM.2022.3152086
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-modal retrieval aims to retrieve relevant data from another modality when given a query of one modality. Although most existing methods that rely on the label information of multimedia data have achieved promising results, the performance benefiting from labeled data comes at a high cost since labeling data often requires enormous labor resources, especially on large-scale multimedia datasets. Therefore, unsupervised cross-modal learning is of crucial importance in real-world applications. In this paper, we propose a novel unsupervised cross-modal retrieval method, named Self-supervised Correlation Learning (SCL), which takes full advantage of large amounts of unlabeled data to learn discriminative and modality-invariant representations. Since unsupervised learning lacks the supervision of category labels, we incorporate the knowledge from the input as a supervisory signal by maximizing the mutual information between the input and the output of different modality-specific projectors. Besides, for the purpose of learning discriminative representations, we exploit unsupervised contrastive learning to model the relationship among intra- and inter-modality instances, which makes similar samples closer and pushes dissimilar samples apart. Moreover, to further eliminate the modality gap, we use a weight-sharing scheme and minimize the modality-invariant loss in the joint representation space. Beyond that, we also extend the proposed method to the semi-supervised setting. Extensive experiments conducted on three widely-used benchmark datasets demonstrate that our method achieves competitive results compared with current state-of-the-art cross-modal retrieval approaches.
引用
下载
收藏
页码:2851 / 2863
页数:13
相关论文
共 50 条
  • [21] Graph Convolutional Network Semantic Enhancement Hashing for Self-supervised Cross-Modal Retrieval
    Hu, Jinyu
    Li, Mingyong
    Zhang, Jiayan
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT IV, 2023, 14257 : 410 - 422
  • [22] Learning Mutual Modulation for Self-supervised Cross-Modal Super-Resolution
    Dong, Xiaoyu
    Yokoya, Naoto
    Wang, Longguang
    Uezato, Tatsumi
    COMPUTER VISION, ECCV 2022, PT XIX, 2022, 13679 : 1 - 18
  • [23] Self-Supervised Intra-Modal and Cross-Modal Contrastive Learning for Point Cloud Understanding
    Wu, Yue
    Liu, Jiaming
    Gong, Maoguo
    Gong, Peiran
    Fan, Xiaolong
    Qin, A. K.
    Miao, Qiguang
    Ma, Wenping
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1626 - 1638
  • [24] Deep Self-Supervised Hashing With Fine-Grained Similarity Mining for Cross-Modal Retrieval
    Han, Lijun
    Wang, Renlin
    Chen, Chunlei
    Zhang, Huihui
    Zhang, Yujie
    Zhang, Wenfeng
    IEEE ACCESS, 2024, 12 : 31756 - 31770
  • [25] Self-Supervised Cross-Modal Distillation for Thermal Infrared Tracking
    Zha, Yufei
    Sun, Jingxian
    Zhang, Peng
    Zhang, Lichao
    Gonzalez-Garcia, Abel
    Huang, Wei
    IEEE MULTIMEDIA, 2022, 29 (04) : 80 - 96
  • [26] Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery
    Wu, Jie Ying
    Tamhane, Aniruddha
    Kazanzides, Peter
    Unberath, Mathias
    INTERNATIONAL JOURNAL OF COMPUTER ASSISTED RADIOLOGY AND SURGERY, 2021, 16 (05) : 779 - 787
  • [27] Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery
    Jie Ying Wu
    Aniruddha Tamhane
    Peter Kazanzides
    Mathias Unberath
    International Journal of Computer Assisted Radiology and Surgery, 2021, 16 : 779 - 787
  • [28] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
    Sarkar, Pritam
    Etemad, Ali
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
  • [29] Trusted 3D self-supervised representation learning with cross-modal settings
    Han, Xu
    Cheng, Haozhe
    Shi, Pengcheng
    Zhu, Jihua
    MACHINE VISION AND APPLICATIONS, 2024, 35 (04)
  • [30] Deep Supervised Cross-modal Retrieval
    Zhen, Liangli
    Hu, Peng
    Wang, Xu
    Peng, Dezhong
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10386 - 10395