Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

被引:0
|
作者
Jin, Peng [1 ,3 ]
Huang, Jinfa [1 ,3 ]
Liu, Fenglin [4 ]
Wu, Xian [5 ]
Ge, Shen [5 ]
Song, Guoli [2 ]
Clifton, David A. [4 ,6 ]
Chen, Jie [1 ,2 ,3 ]
机构
[1] Peking Univ, Sch Elect & Comp Engn, Beijing, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Peking Univ, AI Sci AI4S Preferred Program, Shenzhen Grad Sch, Beijing, Peoples R China
[4] Univ Oxford, Dept Engn Sci, Oxford, England
[5] Tencent JARVIS Lab, Shenzhen, Peoples R China
[6] Oxford Suzhou Ctr Adv Res, Suzhou, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP [53], to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases. Such feature decomposition of video-and-language representations reduces the rank of the latent space, resulting in increased representing power for the semantics. Extensive experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations than previous methods, and significantly outperform previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing approaches either as a jointly training layer or an out-of-the-box inference module with no extra training, making it easy to be incorporated into any existing methods(+/-).
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Using expectation-maximization for reinforcement learning
    Dayan, P
    Hinton, GE
    [J]. NEURAL COMPUTATION, 1997, 9 (02) : 271 - 278
  • [2] Expectation-Maximization via Pretext-Invariant Representations
    Oinar, Chingis
    Le, Binh M.
    Woo, Simon S. S.
    [J]. IEEE ACCESS, 2023, 11 : 65266 - 65276
  • [3] Modality Alignment between Deep Representations for Effective Video-and-Language Learning
    Yun, Hyeongu
    Kim, Yongil
    Jung, Kyomin
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 2759 - 2770
  • [4] Expectation-Maximization for Learning Determinantal Point Processes
    Gillenwater, Jennifer
    Kulesza, Alex
    Fox, Emily
    Taskar, Ben
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 27 (NIPS 2014), 2014, 27
  • [5] EXPECTATION-MAXIMIZATION REGULARISED DEEP LEARNING FOR TUMOUR SEGMENTATION
    Li, Chao
    Huang, Wenjian
    Chen, Xi
    Wei, Yiran
    Zhang, Lipei
    Zhang, Jianguo
    Price, Stephen
    Schonlieb, Carola-Bibiane
    [J]. 2023 IEEE 20TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI, 2023,
  • [6] Constrained Expectation-Maximization Methods for Effective Reinforcement Learning
    Chen, Gang
    Peng, Yiming
    Zhang, Mengjie
    [J]. 2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018, : 171 - 178
  • [7] Expectation-Maximization for Inverse Reinforcement Learning with Hidden Data
    Bogert, Kenneth
    Lin, Jonathan Feng-Shun
    Doshi, Prashant
    Kulic, Dana
    [J]. AAMAS'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS & MULTIAGENT SYSTEMS, 2016, : 1034 - 1042
  • [8] Improved Quasi-Supervised Learning by Expectation-Maximization
    Karacali, Bilge
    [J]. 2013 21ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2013,
  • [9] Video Shot Detection based on SIFT Features and Video Summarization using Expectation-Maximization
    Majumdar, Jharna
    Awale, Manish
    Kumar, Santhosh K. L.
    [J]. 2018 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2018, : 1033 - 1037
  • [10] Revealing Single Frame Bias for Video-and-Language Learning
    Lei, Jie
    Berg, Tamara L.
    Bansal, Mohit
    [J]. arXiv, 2022,