Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding

被引:2
|
作者
Sun, Shengkai [1 ]
Liu, Daizong [2 ]
Dong, Jianfeng [3 ]
Qu, Xiaoye [4 ]
Gao, Junyu [5 ]
Yang, Xun [6 ]
Wang, Xun [3 ]
Wang, Meng [7 ]
机构
[1] Zhejiang Gongshang Univ, Hangzhou, Peoples R China
[2] Peking Univ, Beijing, Peoples R China
[3] Zhejiang Gongshang Univ, Zhejiang Key Lab E Commerce, Hangzhou, Peoples R China
[4] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[5] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
[6] Univ Sci & Technol China, Hefei, Peoples R China
[7] Hefei Univ Technol, Hefei, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-modal Learning; Unsupervised Representation Learning; Action Understanding;
D O I
10.1145/3581783.3612449
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Unsupervised pre-training has shown great success in skeleton-based action understanding recently. Existing works typically train separate modality-specific models (i.e., joint, bone, and motion), then integrate the multi-modal information for action understanding by a late-fusion strategy. Although these approaches have achieved significant performance, they suffer from the complex yet redundant multi-stream model designs, each of which is also limited to the fixed input skeleton modality. To alleviate these issues, in this paper, we propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL, which exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner. Specifically, instead of designing separate modality-specific optimization processes for uni-modal unsupervised learning, we feed different modality inputs into the same stream with an early-fusion strategy to learn their multimodal features for reducing model complexity. To ensure that the fused multi-modal features do not exhibit modality bias, i.e., being dominated by a certain modality input, we further propose both intra- and inter-modal consistency learning to guarantee that the multi-modal features contain the complete semantics of each modal via feature decomposition and distinct alignment. In this manner, our framework is able to learn the unified representations of unimodal or multi-modal skeleton input, which is flexible to different kinds of modality input for robust action understanding in practical cases. Extensive experiments conducted on three large-scale datasets, i.e., NTU-60, NTU-120, and PKU-MMD II, demonstrate that UmURL is highly efficient, possessing the approximate complexity with the uni-modal methods, while achieving new state-of-the-art performance across various downstream task scenarios in skeletonbased action representation learning. Our source code is available at https://github.com/HuiGuanLab/UmURL.
引用
收藏
页码:2973 / 2984
页数:12
相关论文
共 50 条
  • [31] A unified framework for multi-modal federated learning
    Xiong, Baochen
    Yang, Xiaoshan
    Qi, Fan
    Xu, Changsheng
    NEUROCOMPUTING, 2022, 480 : 110 - 118
  • [32] Skeleton MixFormer: Multivariate Topology Representation for Skeleton-based Action Recognition
    Xin, Wentian
    Miao, Qiguang
    Liu, Yi
    Liu, Ruyi
    Pun, Chi-Man
    Shi, Cheng
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2211 - 2220
  • [33] Unsupervised Temporal Adaptation in Skeleton-Based Human Action Recognition
    Tian, Haitao
    Payeur, Pierre
    ALGORITHMS, 2024, 17 (12)
  • [34] Geographic mapping with unsupervised multi-modal representation learning from VHR images and POIs
    Bai, Lubin
    Huang, Weiming
    Zhang, Xiuyuan
    Du, Shihong
    Cong, Gao
    Wang, Haoyu
    Liu, Bo
    ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2023, 201 : 193 - 208
  • [35] Skeleton-guided and supervised learning of hybrid network for multi-modal action recognition☆
    Ren, Ziliang
    Luo, Li
    Qin, Yong
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2025, 107
  • [36] Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning
    Liang, Weixin
    Zhang, Yuhui
    Kwon, Yongchan
    Yeung, Serena
    Zou, James
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [37] Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation
    Ji, Haoyu
    Chen, Bowen
    Xu, Xinglong
    Ren, Weihong
    Wang, Zhiyong
    Liu, Honghai
    COMPUTER VISION - ECCV 2024, PT LIV, 2025, 15112 : 400 - 417
  • [38] Mineral: Multi-modal Network Representation Learning
    Kefato, Zekarias T.
    Sheikh, Nasrullah
    Montresor, Alberto
    MACHINE LEARNING, OPTIMIZATION, AND BIG DATA, MOD 2017, 2018, 10710 : 286 - 298
  • [39] Scalable multi-modal representation learning networks
    Zihan Fang
    Ying Zou
    Shiyang Lan
    Shide Du
    Yanchao Tan
    Shiping Wang
    Artificial Intelligence Review, 58 (7)
  • [40] Decoupled Representation Learning for Skeleton-Based Gesture Recognition
    Liu, Jianbo
    Liu, Yongcheng
    Wang, Ying
    Prinet, Veronique
    Xiang, Shiming
    Pan, Chunhong
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 5750 - 5759