Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding

被引:2
|
作者
Sun, Shengkai [1 ]
Liu, Daizong [2 ]
Dong, Jianfeng [3 ]
Qu, Xiaoye [4 ]
Gao, Junyu [5 ]
Yang, Xun [6 ]
Wang, Xun [3 ]
Wang, Meng [7 ]
机构
[1] Zhejiang Gongshang Univ, Hangzhou, Peoples R China
[2] Peking Univ, Beijing, Peoples R China
[3] Zhejiang Gongshang Univ, Zhejiang Key Lab E Commerce, Hangzhou, Peoples R China
[4] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[5] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
[6] Univ Sci & Technol China, Hefei, Peoples R China
[7] Hefei Univ Technol, Hefei, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-modal Learning; Unsupervised Representation Learning; Action Understanding;
D O I
10.1145/3581783.3612449
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Unsupervised pre-training has shown great success in skeleton-based action understanding recently. Existing works typically train separate modality-specific models (i.e., joint, bone, and motion), then integrate the multi-modal information for action understanding by a late-fusion strategy. Although these approaches have achieved significant performance, they suffer from the complex yet redundant multi-stream model designs, each of which is also limited to the fixed input skeleton modality. To alleviate these issues, in this paper, we propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL, which exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner. Specifically, instead of designing separate modality-specific optimization processes for uni-modal unsupervised learning, we feed different modality inputs into the same stream with an early-fusion strategy to learn their multimodal features for reducing model complexity. To ensure that the fused multi-modal features do not exhibit modality bias, i.e., being dominated by a certain modality input, we further propose both intra- and inter-modal consistency learning to guarantee that the multi-modal features contain the complete semantics of each modal via feature decomposition and distinct alignment. In this manner, our framework is able to learn the unified representations of unimodal or multi-modal skeleton input, which is flexible to different kinds of modality input for robust action understanding in practical cases. Extensive experiments conducted on three large-scale datasets, i.e., NTU-60, NTU-120, and PKU-MMD II, demonstrate that UmURL is highly efficient, possessing the approximate complexity with the uni-modal methods, while achieving new state-of-the-art performance across various downstream task scenarios in skeletonbased action representation learning. Our source code is available at https://github.com/HuiGuanLab/UmURL.
引用
收藏
页码:2973 / 2984
页数:12
相关论文
共 50 条
  • [41] MULTI-MODAL FUSION WITH OBSERVATION POINTS FOR SKELETON ACTION RECOGNITION
    Singh, Iqbal
    Zhu, Xiaodan
    Greenspan, Michael
    2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1781 - 1785
  • [42] LEARNING UNIFIED SPARSE REPRESENTATIONS FOR MULTI-MODAL DATA
    Wang, Kaiye
    Wang, Wei
    Wang, Liang
    2015 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2015, : 3545 - 3549
  • [43] A High Invariance Motion Representation for Skeleton-Based Action Recognition
    Guo, Songrui
    Pan, Huawei
    Tan, Guanghua
    Chen, Lin
    Gao, Chunming
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2016, 30 (08)
  • [44] Multi-Granularity Anchor-Contrastive Representation Learning for Semi-Supervised Skeleton-Based Action Recognition
    Shu, Xiangbo
    Xu, Binqian
    Zhang, Liyan
    Tang, Jinhui
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7559 - 7576
  • [45] Deep Video Understanding with a Unified Multi-Modal Retrieval Framework
    Xie, Chen-Wei
    Sun, Siyang
    Zhao, Liming
    Wu, Jianmin
    Li, Dangwei
    Zheng, Yun
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 7055 - 7059
  • [46] Multi-modal entity alignment based on joint knowledge representation learning
    Wang H.-Y.
    Lun B.
    Zhang X.-M.
    Sun X.-L.
    Kongzhi yu Juece/Control and Decision, 2021, 35 (12): : 2855 - 2864
  • [47] Unified route representation learning for multi-modal transportation recommendation with spatiotemporal pre-training
    Hao Liu
    Jindong Han
    Yanjie Fu
    Yanyan Li
    Kai Chen
    Hui Xiong
    The VLDB Journal, 2023, 32 : 325 - 342
  • [48] Unified route representation learning for multi-modal transportation recommendation with spatiotemporal pre-training
    Liu, Hao
    Han, Jindong
    Fu, Yanjie
    Li, Yanyan
    Chen, Kai
    Xiong, Hui
    VLDB JOURNAL, 2023, 32 (02): : 325 - 342
  • [49] Multi-modal degradation feature learning for unified image restoration based on contrastive learning
    Chen, Lei
    Xiong, Qingbo
    Zhang, Wei
    Liang, Xiaoli
    Gan, Zhihua
    Li, Liqiang
    He, Xin
    NEUROCOMPUTING, 2025, 616
  • [50] Fine-Grained Unsupervised Temporal Action Segmentation and Distributed Representation for Skeleton-Based Human Motion Analysis
    Ma, Hao
    Yang, Zaiyue
    Liu, Haoyang
    IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (12) : 13411 - 13424