Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning

被引:6
|
作者
Zhang, Zehua [1 ]
Crandall, David [1 ]
机构
[1] Indiana Univ, Bloomington, IN 47405 USA
关键词
D O I
10.1109/WACV51458.2022.00105
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a novel technique for self-supervised video representation learning by: (a) decoupling the learning objective into two contrastive subtasks respectively emphasizing spatial and temporal features, and (b) performing it hierarchically to encourage multi-scale understanding. Motivated by their effectiveness in supervised learning, we first introduce spatial-temporal feature learning decoupling and hierarchical learning to the context of unsupervised video learning. We show by experiments that augmentations can be manipulated as regularization to guide the network to learn desired semantics in contrastive learning, and we propose a way for the model to separately capture spatial and temporal features at multiple scales. We also introduce an approach to overcome the problem of divergent levels of instance invariance at different hierarchies by modeling the invariance as loss weights for objective re-weighting. Experiments on downstream action recognition benchmarks on UCF101 and HMDB51 show that our proposed Hierarchically Decoupled Spatial-Temporal Contrast (HDC) makes substantial improvements over directly learning spatial-temporal features as a whole and achieves competitive performance when compared with other state-of-the-art unsupervised methods. Code will be made available.
引用
收藏
页码:975 / 985
页数:11
相关论文
共 50 条
  • [1] Attentive spatial-temporal contrastive learning for self-supervised video representation
    Yang, Xingming
    Xiong, Sixuan
    Wu, Kewei
    Shan, Dongfeng
    Xie, Zhao
    [J]. IMAGE AND VISION COMPUTING, 2023, 137
  • [2] Self-Supervised Representation Learning With Spatial-Temporal Consistency for Sign Language Recognition
    Zhao, Weichao
    Zhou, Wengang
    Hu, Hezhen
    Wang, Min
    Li, Houqiang
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 4188 - 4201
  • [3] Spatial-Temporal Hypergraph Self-Supervised Learning for Crime Prediction
    Li, Zhonghang
    Huang, Chao
    Xia, Lianghao
    Xu, Yong
    Pei, Jian
    [J]. 2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 2984 - 2996
  • [4] SSRL: Self-Supervised Spatial-Temporal Representation Learning for 3D Action Recognition
    Jin, Zhihao
    Wang, Yifan
    Wang, Qicong
    Shen, Yehu
    Meng, Hongying
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (01) : 274 - 285
  • [5] Spatial-then-Temporal Self-Supervised Learning for Video Correspondence
    Li, Rui
    Liu, Dong
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2279 - 2288
  • [6] TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning
    Liu, Yang
    Wang, Keze
    Liu, Lingbo
    Lan, Haoyuan
    Lin, Liang
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1978 - 1993
  • [7] Spatial and temporal features unified self-supervised representation learning networks
    Choudhary, Rahul
    Walambe, Rahee
    Kotecha, Ketan
    [J]. ROBOTICS AND AUTONOMOUS SYSTEMS, 2022, 157
  • [8] SELF-SUPERVISED REPRESENTATION LEARNING FOR ULTRASOUND VIDEO
    Jiao, Jianbo
    Droste, Richard
    Drukker, Lior
    Papageorghiou, Aris T.
    Noble, J. Alison
    [J]. 2020 IEEE 17TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2020), 2020, : 1847 - 1850
  • [9] Self-Supervised Dynamic Graph Representation Learning via Temporal Subgraph Contrast
    Chen, Ke-Jia
    Liu, Linsong
    Jiang, Linpu
    Chen, Jingqiang
    [J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2024, 18 (01)
  • [10] Hierarchically Self-supervised Transformer for Human Skeleton Representation Learning
    Chen, Yuxiao
    Zhao, Long
    Yuan, Jianbo
    Tian, Yu
    Xia, Zhaoyang
    Geng, Shijie
    Han, Ligong
    Metaxas, Dimitris N.
    [J]. COMPUTER VISION, ECCV 2022, PT XXVI, 2022, 13686 : 185 - 202