Learning from Untrimmed Videos: Self-Supervised Video Representation Learning with Hierarchical Consistency

被引:6
|
作者
Qing, Zhiwu [1 ]
Zhang, Shiwei [2 ]
Huang, Ziyuan [3 ]
Xu, Yi [4 ]
Wang, Xiang [1 ]
Tang, Mingqian [2 ]
Gao, Changxin [1 ]
Jin, Rong [2 ]
Sang, Nong [1 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Artificial Intelligence & Automat, Key Lab Image Proc & Intelligent Control, Wuhan, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Natl Univ Singapore, ARC, Singapore, Singapore
[4] Dalian Univ Technol, Dalian, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52688.2022.01345
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Natural videos provide rich visual contents for selfsupervised learning. Yet most existing approaches for learning spatio-temporal representations rely on manually trimmed videos, leading to limited diversity in visual patterns and limited performance gain. In this work, we aim to learn representations by leveraging more abundant information in untrimmed videos. To this end, we propose to learn a hierarchy of consistencies in videos, i.e., visual consistency and topical consistency, corresponding respectively to clip pairs that tend to be visually similar when separated by a short time span and share similar topics when separated by a long time span. Specifically, a hierarchical consistency learning framework HiCo is presented, where the visually consistent pairs are encouraged to have the same representation through contrastive learning, while the topically consistent pairs are coupled through a topical classifier that distinguishes whether they are topicrelated. Further, we impose a gradual sampling algorithm for proposed hierarchical consistency learning, and demonstrate its theoretical superiority. Empirically, we show that not only HiCo can generate stronger representations on untrimmed videos, it also improves the representation quality when applied to trimmed videos. This is in contrast to standard contrastive learning that fails to learn appropriate representations from untrimmed videos.
引用
收藏
页码:13811 / 13821
页数:11
相关论文
共 50 条
  • [1] Self-Supervised Learning from Untrimmed Videos via Hierarchical Consistency
    Qing, Zhiwu
    Zhang, Shiwei
    Huang, Ziyuan
    Xu, Yi
    Wang, Xiang
    Gao, Changxin
    Jin, Rong
    Sang, Nong
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (10) : 12408 - 12426
  • [2] Exploring Relations in Untrimmed Videos for Self-Supervised Learning
    Luo, Dezhao
    Zhou, Yu
    Fang, Bo
    Zhou, Yucan
    Wu, Dayan
    Wang, Weiping
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (01)
  • [3] ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency
    Huang, Deng
    Wu, Wenhao
    Hu, Weiwen
    Liu, Xu
    He, Dongliang
    Wu, Zhihua
    Wu, Xiangmiao
    Tan, Mingkui
    Ding, Errui
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 8076 - 8085
  • [4] SELF-SUPERVISED REPRESENTATION LEARNING FOR ULTRASOUND VIDEO
    Jiao, Jianbo
    Droste, Richard
    Drukker, Lior
    Papageorghiou, Aris T.
    Noble, J. Alison
    [J]. 2020 IEEE 17TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2020), 2020, : 1847 - 1850
  • [5] Self-Supervised Visual Representation Learning from Hierarchical Grouping
    Zhang, Xiao
    Maire, Michael
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [6] Self-Supervised Video Representation Learning by Video Incoherence Detection
    Cao, Haozhi
    Xu, Yuecong
    Mao, Kezhi
    Xie, Lihua
    Yin, Jianxiong
    See, Simon
    Xu, Qianwen
    Yang, Jianfei
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (06) : 3810 - 3822
  • [7] Hierarchical Self-supervised Representation Learning for Movie Understanding
    Xiao, Fanyi
    Kundu, Kaustav
    Tighe, Joseph
    Modolo, Davide
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 9717 - 9726
  • [8] SHERLock: Self-Supervised Hierarchical Event Representation Learning
    Roychowdhury, S.
    Sontakke, S. A.
    Itti, L.
    Sarkar, M.
    Aggarwal, M.
    Badjatiya, P.
    Puri, N.
    Krishnamurthy, B.
    [J]. 2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 2672 - 2678
  • [9] Video Face Clustering with Self-Supervised Representation Learning
    Sharma, Vivek
    Tapaswi, Makarand
    Saquib Sarfraz, M.
    Stiefelhagen, Rainer
    [J]. IEEE Transactions on Biometrics, Behavior, and Identity Science, 2020, 2 (02): : 145 - 157
  • [10] Self-Supervised Representation Learning for Video Quality Assessment
    Jiang, Shaojie
    Sang, Qingbing
    Hu, Zongyao
    Liu, Lixiong
    [J]. IEEE TRANSACTIONS ON BROADCASTING, 2023, 69 (01) : 118 - 129