Contrastive 3D Human Skeleton Action Representation Learning via CrossMoCo With Spatiotemporal Occlusion Mask Data Augmentation

被引:1
|
作者
Zeng, Qinyang [1 ]
Liu, Chengju [1 ,2 ]
Liu, Ming [3 ]
Chen, Qijun [1 ]
机构
[1] Tongji Univ, Coll Elect & Informat Engn, Shanghai 201804, Peoples R China
[2] Tongji Res Inst Artificial Intelligence Suzhou, Suzhou 215300, Peoples R China
[3] Hong Kong Univ Sci & Technol, Dept Elect & Comp Engn, Hong Kong 999077, Peoples R China
基金
中国国家自然科学基金;
关键词
Skeleton; Feature extraction; Spatiotemporal phenomena; Three-dimensional displays; Data mining; Joints; Learning systems; Cross contrastive learning; spatiotemporal occlusion mask; human skeleton action recognition;
D O I
10.1109/TMM.2023.3253048
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Self-supervised learning methods for 3D skeleton-based action recognition via contrastive learning have obtained competitive achievements compared to classical supervised methods. Current researches show that adding a Multilayer Perceptron (MLP) to the top of the base encoder can extract high-level and global positive representations. Using a negative memory bank to store negative samples dynamically can balance the ample storage and feature consistency. However, these methods need to consider that the MLP lacks accurate encoding of fine-grained local features, and a memory bank needs rich and diverse negative sample pairs to match positive representations from different encoders. This paper proposes a new method called Cross Momentum Contrast (CrossMoCo), composed of three parts: ST-GCN encoder, ST-GCN encoder with MLP encoder (ST-MLP encoder), and two independent negative memory banks. The two encoders encode the input data into two positive feature pairs. Learning the cross representations of the two positive pairs is helpful for the model to extract both the global and the local information. Two independent negative memory banks update the negative samples according to different positive representations from two encoders, diversifying the negative samples' distribution and making negative representations close to the positive features. The increasing classification difficulty will improve the model's ability of contrastive learning. In addition, the spatiotemporal occlusion mask data augmentation method is used to enhance positive samples' information diversity. This method takes the adjacent skeleton joints that can form a skeleton bone as a mask unit, which can reduce the information redundancy after data augmentation since adjacent joints may carry similar spatiotemporal information. Experiments on the PKU-MMD Part II dataset, the NTU RGB+D 60 dataset, and the NW-UCLA dataset show that the CrossMoCo framework with spatiotemporal occlusion mask data augmentation has achieved a comparable performance.
引用
收藏
页码:1564 / 1574
页数:11
相关论文
共 50 条
  • [1] Skeleton-Contrastive 3D Action Representation Learning
    Thoker, Fida Mohammad
    Doughty, Hazel
    Snoek, Cees G. M.
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1655 - 1663
  • [2] Enhancing Human Action Recognition with 3D Skeleton Data: A Comprehensive Study of Deep Learning and Data Augmentation
    Xin, Chu
    Kim, Seokhwan
    Cho, Yongjoo
    Park, Kyoung Shin
    ELECTRONICS, 2024, 13 (04)
  • [3] Contrastive Positive Mining for Unsupervised 3D Action Representation Learning
    Zhang, Haoyuan
    Hou, Yonghong
    Zhang, Wenjing
    Li, Wanqing
    COMPUTER VISION - ECCV 2022, PT IV, 2022, 13664 : 36 - 51
  • [4] Skeleton Cloud Colorization for Unsupervised 3D Action Representation Learning
    Yang, Siyuan
    Liu, Jun
    Lu, Shijian
    Er, Meng Hwa
    Kot, Alex C.
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13403 - 13413
  • [5] Mutual Information Driven Equivariant Contrastive Learning for 3D Action Representation Learning
    Lin, Lilang
    Zhang, Jiahang
    Liu, Jiaying
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1883 - 1897
  • [6] Human Action Recognition Based on Quaternion 3D Skeleton Representation
    Xu Haiyang
    Kong Jun
    Jiang Min
    LASER & OPTOELECTRONICS PROGRESS, 2018, 55 (02)
  • [7] Adaptive Spatiotemporal Representation Learning for Skeleton-Based Human Action Recognition
    Yu, Jiahui
    Gao, Hongwei
    Chen, Yongquan
    Zhou, Dalin
    Liu, Jinguo
    Ju, Zhaojie
    IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2022, 14 (04) : 1654 - 1665
  • [8] EnsCLR: Unsupervised skeleton-based action recognition via ensemble contrastive learning of representation
    Wang, Kun
    Cao, Jiuxin
    Cao, Biwei
    Liu, Bo
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 247
  • [9] Self-Supervised 3D Action Representation Learning With Skeleton Cloud Colorization
    Yang, Siyuan
    Liu, Jun
    Lu, Shijian
    Hwa, Er Meng
    Hu, Yongjian
    Kot, Alex C.
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (01) : 509 - 524
  • [10] Modeling the Uncertainty for Self-supervised 3D Skeleton Action Representation Learning
    Su, Yukun
    Lin, Guosheng
    Sun, Ruizhou
    Hao, Yun
    Wu, Qingyao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 769 - 778