Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

被引：0

作者：

Jin, Peng ^{[1
,3
]}

Huang, Jinfa ^{[1
,3
]}

Liu, Fenglin ^{[4
]}

Wu, Xian ^{[5
]}

Ge, Shen ^{[5
]}

Song, Guoli ^{[2
]}

Clifton, David A. ^{[4
,6
]}

Chen, Jie ^{[1
,2
,3
]}

机构：

[1] Peking Univ, Sch Elect & Comp Engn, Beijing, Peoples R China

[2] Peng Cheng Lab, Shenzhen, Peoples R China

[3] Peking Univ, AI Sci AI4S Preferred Program, Shenzhen Grad Sch, Beijing, Peoples R China

[4] Univ Oxford, Dept Engn Sci, Oxford, England

[5] Tencent JARVIS Lab, Shenzhen, Peoples R China

[6] Oxford Suzhou Ctr Adv Res, Suzhou, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP [53], to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases. Such feature decomposition of video-and-language representations reduces the rank of the latent space, resulting in increased representing power for the semantics. Extensive experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations than previous methods, and significantly outperform previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing approaches either as a jointly training layer or an out-of-the-box inference module with no extra training, making it easy to be incorporated into any existing methods(+/-).

引用

页数：16

共 50 条

[1] Using expectation-maximization for reinforcement learning
Dayan, P
Hinton, GE
[J]. NEURAL COMPUTATION, 1997, 9 (02) : 271 - 278
[2] Expectation-Maximization via Pretext-Invariant Representations
Oinar, Chingis
Le, Binh M.
Woo, Simon S. S.
[J]. IEEE ACCESS, 2023, 11 : 65266 - 65276
[3] Modality Alignment between Deep Representations for Effective Video-and-Language Learning
Yun, Hyeongu
Kim, Yongil
Jung, Kyomin
[J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 2759 - 2770
[4] Expectation-Maximization for Learning Determinantal Point Processes
Gillenwater, Jennifer
Kulesza, Alex
Fox, Emily
Taskar, Ben
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 27 (NIPS 2014), 2014, 27
[5] EXPECTATION-MAXIMIZATION REGULARISED DEEP LEARNING FOR TUMOUR SEGMENTATION
Li, Chao
Huang, Wenjian
Chen, Xi
Wei, Yiran
Zhang, Lipei
Zhang, Jianguo
Price, Stephen
Schonlieb, Carola-Bibiane
[J]. 2023 IEEE 20TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI, 2023,
[6] Constrained Expectation-Maximization Methods for Effective Reinforcement Learning
Chen, Gang
Peng, Yiming
Zhang, Mengjie
[J]. 2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018, : 171 - 178
[7] Expectation-Maximization for Inverse Reinforcement Learning with Hidden Data
Bogert, Kenneth
Lin, Jonathan Feng-Shun
Doshi, Prashant
Kulic, Dana
[J]. AAMAS'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS & MULTIAGENT SYSTEMS, 2016, : 1034 - 1042
[8] Improved Quasi-Supervised Learning by Expectation-Maximization
Karacali, Bilge
[J]. 2013 21ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2013,
[9] Video Shot Detection based on SIFT Features and Video Summarization using Expectation-Maximization
Majumdar, Jharna
Awale, Manish
Kumar, Santhosh K. L.
[J]. 2018 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2018, : 1033 - 1037
[10] Revealing Single Frame Bias for Video-and-Language Learning
Lei, Jie
Berg, Tamara L.
Bansal, Mohit
[J]. arXiv, 2022,

← 1 2 3 4 5 →