Hierarchically Self-supervised Transformer for Human Skeleton Representation Learning

被引:10
|
作者
Chen, Yuxiao [1 ]
Zhao, Long [2 ]
Yuan, Jianbo [3 ]
Tian, Yu [3 ]
Xia, Zhaoyang [1 ]
Geng, Shijie [1 ]
Han, Ligong [1 ]
Metaxas, Dimitris N. [1 ]
机构
[1] Rutgers State Univ, Piscataway, NJ 08854 USA
[2] Google Res, Los Angeles, CA USA
[3] ByteDance Inc, Seattle, WA USA
来源
关键词
Skeleton representation learning; Self-supervised learning; Action recognition; Action detection; Motion prediction; ACTION RECOGNITION;
D O I
10.1007/978-3-031-19809-0_11
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large scales is difficult. Recent studies focus on learning video-level temporal and discriminative information using contrastive learning, but overlook the hierarchical spatial-temporal nature of human skeletons. Different from such superficial supervision at the video level, we propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS), to explicitly capture spatial, short-term, and long-term temporal dependencies at frame, clip, and video levels, respectively. To evaluate the proposed self-supervised pre-training scheme with Hi-TRS, we conduct extensive experiments covering three skeleton-based downstream tasks including action recognition, action detection, and motion prediction. Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance. Additionally, we demonstrate that the prior knowledge learned by our model in the pre-training stage has strong transfer capability for different downstream tasks. The source code can be found at https://github.com/yuxiaochen1103/Hi-TRS.
引用
收藏
页码:185 / 202
页数:18
相关论文
共 50 条
  • [1] TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech
    Liu, Andy T.
    Li, Shang-Wen
    Lee, Hung-yi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2351 - 2366
  • [2] Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation
    Luo, Jian
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    INTERSPEECH 2021, 2021, : 1169 - 1173
  • [3] Self-supervised representation learning using multimodal Transformer for emotion recognition
    Goetz, Theresa
    Arora, Pulkit
    Erick, F. X.
    Holzer, Nina
    Sawant, Shrutika
    PROCEEDINGS OF THE 8TH INTERNATIONAL WORKSHOP ON SENSOR-BASED ACTIVITY RECOGNITION AND ARTIFICIAL INTELLIGENCE, IWOAR 2023, 2023,
  • [4] Self-Supervised Representation Learning for Skeleton-Based Group Activity Recognition
    Bian, Cunling
    Feng, Wei
    Wang, Song
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5990 - 5998
  • [5] Self-Supervised Action Representation Learning Based on Asymmetric Skeleton Data Augmentation
    Zhou, Hualing
    Li, Xi
    Xu, Dahong
    Liu, Hong
    Guo, Jianping
    Zhang, Yihan
    SENSORS, 2022, 22 (22)
  • [6] Self-supervised action representation learning from partial consistency skeleton sequences
    Lin B.
    Zhan Y.
    Neural Computing and Applications, 2024, 36 (20) : 12385 - 12395
  • [7] Whitening for Self-Supervised Representation Learning
    Ermolov, Aleksandr
    Siarohin, Aliaksandr
    Sangineto, Enver
    Sebe, Nicu
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [8] Self-Supervised Representation Learning for CAD
    Jones, Benjamin T.
    Hu, Michael
    Kodnongbua, Milin
    Kim, Vladimir G.
    Schulz, Adriana
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 21327 - 21336
  • [9] Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning
    Zhang, Zehua
    Crandall, David
    2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 975 - 985
  • [10] Self-Supervised Time Series Representation Learning via Cross Reconstruction Transformer
    Zhang, Wenrui
    Yang, Ling
    Geng, Shijia
    Hong, Shenda
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, : 1 - 10