Learning from Untrimmed Videos: Self-Supervised Video Representation Learning with Hierarchical Consistency

被引:6
|
作者
Qing, Zhiwu [1 ]
Zhang, Shiwei [2 ]
Huang, Ziyuan [3 ]
Xu, Yi [4 ]
Wang, Xiang [1 ]
Tang, Mingqian [2 ]
Gao, Changxin [1 ]
Jin, Rong [2 ]
Sang, Nong [1 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Artificial Intelligence & Automat, Key Lab Image Proc & Intelligent Control, Wuhan, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Natl Univ Singapore, ARC, Singapore, Singapore
[4] Dalian Univ Technol, Dalian, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52688.2022.01345
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Natural videos provide rich visual contents for selfsupervised learning. Yet most existing approaches for learning spatio-temporal representations rely on manually trimmed videos, leading to limited diversity in visual patterns and limited performance gain. In this work, we aim to learn representations by leveraging more abundant information in untrimmed videos. To this end, we propose to learn a hierarchy of consistencies in videos, i.e., visual consistency and topical consistency, corresponding respectively to clip pairs that tend to be visually similar when separated by a short time span and share similar topics when separated by a long time span. Specifically, a hierarchical consistency learning framework HiCo is presented, where the visually consistent pairs are encouraged to have the same representation through contrastive learning, while the topically consistent pairs are coupled through a topical classifier that distinguishes whether they are topicrelated. Further, we impose a gradual sampling algorithm for proposed hierarchical consistency learning, and demonstrate its theoretical superiority. Empirically, we show that not only HiCo can generate stronger representations on untrimmed videos, it also improves the representation quality when applied to trimmed videos. This is in contrast to standard contrastive learning that fails to learn appropriate representations from untrimmed videos.
引用
收藏
页码:13811 / 13821
页数:11
相关论文
共 50 条
  • [41] Self-Supervised Representation Learning from Flow Equivariance
    Xiong, Yuwen
    Ren, Mengye
    Zeng, Wenyuan
    Urtasun, Raquel
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 10171 - 10180
  • [42] Enhancing motion visual cues for self-supervised video representation learning
    Nie, Mu
    Quan, Zhibin
    Ding, Weiping
    Yang, Wankou
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 123
  • [43] TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning
    Liu, Yang
    Wang, Keze
    Liu, Lingbo
    Lan, Haoyuan
    Lin, Liang
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1978 - 1993
  • [44] Dynamic-boosting attention for self-supervised video representation learning
    Zhipeng Wang
    Chunping Hou
    Guanghui Yue
    Qingyuan Yang
    [J]. Applied Intelligence, 2022, 52 : 3143 - 3155
  • [45] Self-Supervised Learning of Video Representation for Anticipating Actions in Early Stage
    Liu, Yinan
    Wu, Qingbo
    Tang, Liangzhi
    Xu, Linfeng
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2018, E101D (05): : 1449 - 1452
  • [46] Self-Supervised Video Representation Learning by Serial Restoration With Elastic Complexity
    Chen, Ziyu
    Wang, Hanli
    Chen, Chang Wen
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2235 - 2248
  • [47] Self-supervised learning of class embeddings from video
    Wiles, Olivia
    Koepke, A. Sophia
    Zisserman, Andrew
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 3019 - 3027
  • [48] Dynamic-boosting attention for self-supervised video representation learning
    Wang, Zhipeng
    Hou, Chunping
    Yue, Guanghui
    Yang, Qingyuan
    [J]. APPLIED INTELLIGENCE, 2022, 52 (03) : 3143 - 3155
  • [49] Self-Supervised Video Representation Learning with Meta-Contrastive Network
    Lin, Yuanze
    Guo, Xun
    Lu, Yan
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 8219 - 8229
  • [50] Self-supervised Object-Centric Learning for Videos
    Aydemir, Gorkay
    Xie, Weidi
    Guney, Fatma
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,