Masked cosine similarity prediction for self-supervised skeleton-based action representation learning

被引:0
|
作者
Ziliang Ren [1 ]
Ronggui Liu [2 ]
Yong Qin [1 ]
Xiangyang Gao [1 ]
Qieshi Zhang [2 ]
机构
[1] Dongguan University of Technology,School of Computer Science and Technology
[2] Chinese Academy of Sciences,CAS Key Laboratory of Human
关键词
Skeleton-based action recognition; Self-supervised learning; Masked autoencoders;
D O I
10.1007/s10044-025-01472-3
中图分类号
学科分类号
摘要
Skeleton-based human action recognition faces challenges owing to the limited availability of annotated data, which constrains the performance of supervised methods in learning representations of skeleton sequences. To address this issue, researchers have introduced self-supervised learning as a method of reducing the reliance on annotated data. This approach exploits the intrinsic supervisory signals embedded within the data itself. In this study, we demonstrate that considering relative positional relationships between joints, rather than relying on joint coordinates as absolute positional information, yields more effective representations of skeleton sequences. Based on this, we introduce the Masked Cosine Similarity Prediction (MCSP) framework, which takes randomly masked skeleton sequences as input and predicts the corresponding cosine similarity between masked joints. Comprehensive experiments show that the proposed MCSP self-supervised pre-training method effectively learns representations in skeleton sequences, improving model performance while decreasing dependence on extensive labeled datasets. After pre-training with MCSP, a vanilla transformer architecture is employed for fine-tuning in action recognition. The results obtained from six subsets of the NTU-RGB+D 60, NTU-RGB+D 120 and PKU-MMD datasets show that our method achieves significant performance improvements on five subsets. Compared to training from scratch, performance improvements are 9.8%, 4.9%, 13%, 11.5%, and 3.6%, respectively, with top-1 accuracies of 92.9%, 97.3%, 89.8%, 91.2%, and 96.1% being achieved. Furthermore, our method achieves comparable results on the PKU-MMD Phase II dataset, achieving a top-1 accuracy of 51.5%. These results are competitive without the need for intricate designs, such as multi-stream model ensembles or extreme data augmentation. The source code of our MOSP is available at https://github.com/skyisyourlimit/MCSP.
引用
收藏
相关论文
共 50 条
  • [1] Contrast-Reconstruction Representation Learning for Self-Supervised Skeleton-Based Action Recognition
    Wang, Peng
    Wen, Jun
    Si, Chenyang
    Qian, Yuntao
    Wang, Liang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 6224 - 6238
  • [2] Self-Supervised Representation Learning for Skeleton-Based Group Activity Recognition
    Bian, Cunling
    Feng, Wei
    Wang, Song
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5990 - 5998
  • [3] Frequency Decoupled Masked Auto-Encoder for Self-Supervised Skeleton-Based Action Recognition
    Liu, Ye
    Shi, Tianhao
    Zhai, Mingliang
    Liu, Jun
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 546 - 550
  • [4] Global and Local Contrastive Learning for Self-Supervised Skeleton-Based Action Recognition
    Hu, Jinhua
    Hou, Yonghong
    Guo, Zihui
    Gao, Jiajun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 10578 - 10589
  • [5] Cross-stream contrastive learning for self-supervised skeleton-based action recognition
    Li, Ding
    Tang, Yongqiang
    Zhang, Zhizhong
    Zhang, Wensheng
    IMAGE AND VISION COMPUTING, 2023, 135
  • [6] Self-Supervised Action Representation Learning Based on Asymmetric Skeleton Data Augmentation
    Zhou, Hualing
    Li, Xi
    Xu, Dahong
    Liu, Hong
    Guo, Jianping
    Zhang, Yihan
    SENSORS, 2022, 22 (22)
  • [7] SG-CLR: Semantic representation-guided contrastive learning for self-supervised skeleton-based action recognition
    Liu, Ruyi
    Liu, Yi
    Wu, Mengyao
    Xin, Wentian
    Miao, Qiguang
    Liu, Xiangzeng
    Lie, Long
    PATTERN RECOGNITION, 2025, 162
  • [8] Temporal-masked skeleton-based action recognition with supervised contrastive learning
    Zhao, Zhifeng
    Chen, Guodong
    Lin, Yuxiang
    SIGNAL IMAGE AND VIDEO PROCESSING, 2023, 17 (05) : 2267 - 2275
  • [9] Temporal-masked skeleton-based action recognition with supervised contrastive learning
    Zhifeng Zhao
    Guodong Chen
    Yuxiang Lin
    Signal, Image and Video Processing, 2023, 17 : 2267 - 2275
  • [10] Focalized contrastive view-invariant learning for self-supervised skeleton-based action recognition
    Men, Qianhui
    Ho, Edmond S. L.
    Shum, Hubert P. H.
    Leung, Howard
    NEUROCOMPUTING, 2023, 537 : 198 - 209