Contrastive 3D Human Skeleton Action Representation Learning via CrossMoCo With Spatiotemporal Occlusion Mask Data Augmentation

被引:1
|
作者
Zeng, Qinyang [1 ]
Liu, Chengju [1 ,2 ]
Liu, Ming [3 ]
Chen, Qijun [1 ]
机构
[1] Tongji Univ, Coll Elect & Informat Engn, Shanghai 201804, Peoples R China
[2] Tongji Res Inst Artificial Intelligence Suzhou, Suzhou 215300, Peoples R China
[3] Hong Kong Univ Sci & Technol, Dept Elect & Comp Engn, Hong Kong 999077, Peoples R China
基金
中国国家自然科学基金;
关键词
Skeleton; Feature extraction; Spatiotemporal phenomena; Three-dimensional displays; Data mining; Joints; Learning systems; Cross contrastive learning; spatiotemporal occlusion mask; human skeleton action recognition;
D O I
10.1109/TMM.2023.3253048
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Self-supervised learning methods for 3D skeleton-based action recognition via contrastive learning have obtained competitive achievements compared to classical supervised methods. Current researches show that adding a Multilayer Perceptron (MLP) to the top of the base encoder can extract high-level and global positive representations. Using a negative memory bank to store negative samples dynamically can balance the ample storage and feature consistency. However, these methods need to consider that the MLP lacks accurate encoding of fine-grained local features, and a memory bank needs rich and diverse negative sample pairs to match positive representations from different encoders. This paper proposes a new method called Cross Momentum Contrast (CrossMoCo), composed of three parts: ST-GCN encoder, ST-GCN encoder with MLP encoder (ST-MLP encoder), and two independent negative memory banks. The two encoders encode the input data into two positive feature pairs. Learning the cross representations of the two positive pairs is helpful for the model to extract both the global and the local information. Two independent negative memory banks update the negative samples according to different positive representations from two encoders, diversifying the negative samples' distribution and making negative representations close to the positive features. The increasing classification difficulty will improve the model's ability of contrastive learning. In addition, the spatiotemporal occlusion mask data augmentation method is used to enhance positive samples' information diversity. This method takes the adjacent skeleton joints that can form a skeleton bone as a mask unit, which can reduce the information redundancy after data augmentation since adjacent joints may carry similar spatiotemporal information. Experiments on the PKU-MMD Part II dataset, the NTU RGB+D 60 dataset, and the NW-UCLA dataset show that the CrossMoCo framework with spatiotemporal occlusion mask data augmentation has achieved a comparable performance.
引用
收藏
页码:1564 / 1574
页数:11
相关论文
共 50 条
  • [21] Human skeleton representation for 3D action recognition based on complex network coding and LSTM
    Shen, Xiangpei
    Ding, Yanrui
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2022, 82
  • [22] OCR-Pose: Occlusion-aware Contrastive Representation for Unsupervised 3D Human Pose Estimation
    Wang, Junjie
    Yu, Zhenbo
    Tong, Zhengyan
    Wang, Hang
    Liu, Jinxian
    Zhang, Wenjun
    Wu, Xiaoyan
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5477 - 5485
  • [23] A Contrastive Learning Method for the Visual Representation of 3D Point Clouds
    Zhu, Feng
    Zhao, Jieyu
    Cai, Zhengyi
    ALGORITHMS, 2022, 15 (03)
  • [24] Learning Composite Latent Structures for 3D Human Action Representation and Recognition
    Wei, Ping
    Sun, Hongbin
    Zheng, Nanning
    IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (09) : 2195 - 2208
  • [25] Action-conditioned contrastive learning for 3D human pose and shape estimation in videos
    Song, Inpyo
    Ryu, Moonwook
    Lee, Jangwon
    Computer Vision and Image Understanding, 2024, 249
  • [26] Learning to recognise 3D human action from a new skeleton-based representation using deep convolutional neural networks
    Huy-Hieu Pham
    Khoudour, Louahdi
    Crouzil, Alain
    Zegers, Pablo
    Velastin, Sergio A.
    IET COMPUTER VISION, 2019, 13 (03) : 319 - 328
  • [27] SELF-SUPERVISED 3D SKELETON REPRESENTATION LEARNING WITH ACTIVE SAMPLING AND ADAPTIVE RELABELING FOR ACTION RECOGNITION
    Wang, Guoquan
    Liu, Hong
    Guo, Tianyu
    Guo, Jingwen
    Wang, Ti
    Li, Yidi
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 56 - 60
  • [28] Generalized Pose Decoupled Network for Unsupervised 3D Skeleton Sequence-Based Action Representation Learning
    Liu, Mengyuan
    Meng, Fanyang
    Liang, Yongsheng
    CYBORG AND BIONIC SYSTEMS, 2022, 2022
  • [29] On Active Labeling 3D Point Clouds via Contrastive Learning
    Yang G.
    Lai W.
    Huang H.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2023, 35 (11): : 1664 - 1673
  • [30] Data Augmentation Based on 3D Model Data for Machine Learning
    Iwasaki, Masumi
    Yoshioka, Rentaro
    2019 IEEE 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION SYSTEMS (ICCCS 2019), 2019, : 1 - 4