Contrastive Learning of Global-Local Video Representations

被引:0
|
作者
Ma, Shuang [1 ]
Zeng, Zhaoyang [2 ]
McDuff, Daniel [3 ]
Song, Yale [3 ]
机构
[1] Microsoft, Redmond, WA 98052 USA
[2] Sun Yat Sen Univ, Guangzhou, Peoples R China
[3] Microsoft Res, Redmond, WA USA
关键词
NETWORK; SOUND;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Contrastive learning has delivered impressive results for various tasks in the selfsupervised regime. However, existing approaches optimize for learning representations specific to downstream scenarios, i.e., global representations suitable for tasks such as classification or local representations for tasks such as detection and localization. While they produce satisfactory results in the intended downstream scenarios, they often fail to generalize to tasks that they were not originally designed for. In this work, we propose to learn video representations that generalize to both the tasks which require global semantic information (e.g., classification) and the tasks that require local fine-grained spatio-temporal information (e.g., localization). We achieve this by optimizing two contrastive objectives that together encourage our model to learn global-local visual information given audio signals. We show that the two objectives mutually improve the generalizability of the learned global-local representations, significantly outperforming their disjointly learned counterparts. We demonstrate our approach on various tasks including action/sound classification, lip reading, deepfake detection, event and sound localization.(1)
引用
收藏
页数:16
相关论文
共 50 条
  • [1] A GLOBAL-LOCAL CONTRASTIVE LEARNING FRAMEWORK FOR VIDEO CAPTIONING
    Huang, Qunyue
    Fang, Bin
    Ai, Xi
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2410 - 2414
  • [2] Violent Video Recognition Based on Global-Local Visual and Audio Contrastive Learning
    Liu, Zihao
    Wu, Xiaoyu
    Wang, Shengjin
    Shang, Yimeng
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 476 - 480
  • [3] Global-Local Temporal Representations For Video Person Re-Identification
    Li, Jianing
    Wang, Jingdong
    Tian, Qi
    Gao, Wen
    Zhang, Shiliang
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 3957 - 3966
  • [4] Global-local contrastive multiview representation learning for skeleton-based action
    Bian, Cunling
    Feng, Wei
    Meng, Fanbo
    Wang, Song
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 229
  • [5] Probabilistic Representations for Video Contrastive Learning
    Park, Jungin
    Lee, Jiyoung
    Kim, Ig-Jae
    Sohn, Kwanghoon
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 14691 - 14701
  • [6] Video Captioning Using Global-Local Representation
    Yan, Liqi
    Ma, Siqi
    Wang, Qifan
    Chen, Yingjie
    Zhang, Xiangyu
    Savakis, Andreas
    Liu, Dongfang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (10) : 6642 - 6656
  • [7] Video Contrastive Learning with Global Context
    Kuang, Haofei
    Zhu, Yi
    Zhang, Zhi
    Li, Xinyu
    Tighe, Joseph
    Schwertfeger, Soeren
    Stachniss, Cyrill
    Li, Mu
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3188 - 3197
  • [8] Video representation learning for temporal action detection using global-local attention
    Tang, Yiping
    Zheng, Yang
    Wei, Chen
    Guo, Kaitai
    Hu, Haihong
    Liang, Jimin
    PATTERN RECOGNITION, 2022, 134
  • [9] Extended Global-Local Representation Learning for Video Person Re-Identification
    Song, Wanru
    Wu, Yahong
    Zheng, Jieying
    Chen, Changhong
    Liu, Feng
    IEEE ACCESS, 2019, 7 : 122684 - 122696
  • [10] Motion-Focused Contrastive Learning of Video Representations
    Li, Rui
    Zhang, Yiheng
    Qiu, Zhaofan
    Yao, Ting
    Liu, Dong
    Mei, Tao
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2085 - 2094