Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding

被引:2
|
作者
Sun, Shengkai [1 ]
Liu, Daizong [2 ]
Dong, Jianfeng [3 ]
Qu, Xiaoye [4 ]
Gao, Junyu [5 ]
Yang, Xun [6 ]
Wang, Xun [3 ]
Wang, Meng [7 ]
机构
[1] Zhejiang Gongshang Univ, Hangzhou, Peoples R China
[2] Peking Univ, Beijing, Peoples R China
[3] Zhejiang Gongshang Univ, Zhejiang Key Lab E Commerce, Hangzhou, Peoples R China
[4] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[5] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
[6] Univ Sci & Technol China, Hefei, Peoples R China
[7] Hefei Univ Technol, Hefei, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-modal Learning; Unsupervised Representation Learning; Action Understanding;
D O I
10.1145/3581783.3612449
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Unsupervised pre-training has shown great success in skeleton-based action understanding recently. Existing works typically train separate modality-specific models (i.e., joint, bone, and motion), then integrate the multi-modal information for action understanding by a late-fusion strategy. Although these approaches have achieved significant performance, they suffer from the complex yet redundant multi-stream model designs, each of which is also limited to the fixed input skeleton modality. To alleviate these issues, in this paper, we propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL, which exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner. Specifically, instead of designing separate modality-specific optimization processes for uni-modal unsupervised learning, we feed different modality inputs into the same stream with an early-fusion strategy to learn their multimodal features for reducing model complexity. To ensure that the fused multi-modal features do not exhibit modality bias, i.e., being dominated by a certain modality input, we further propose both intra- and inter-modal consistency learning to guarantee that the multi-modal features contain the complete semantics of each modal via feature decomposition and distinct alignment. In this manner, our framework is able to learn the unified representations of unimodal or multi-modal skeleton input, which is flexible to different kinds of modality input for robust action understanding in practical cases. Extensive experiments conducted on three large-scale datasets, i.e., NTU-60, NTU-120, and PKU-MMD II, demonstrate that UmURL is highly efficient, possessing the approximate complexity with the uni-modal methods, while achieving new state-of-the-art performance across various downstream task scenarios in skeletonbased action representation learning. Our source code is available at https://github.com/HuiGuanLab/UmURL.
引用
收藏
页码:2973 / 2984
页数:12
相关论文
共 50 条
  • [21] Fast unsupervised multi-modal hashing based on piecewise learning
    Li, Yinan
    Long, Jun
    Tu, Zerong
    Yang, Zhan
    KNOWLEDGE-BASED SYSTEMS, 2024, 299
  • [22] Multi-modal Relation Distillation for Unified 3D Representation Learning
    Wang, Huiqun
    Bao, Yiping
    Pan, Panwang
    Li, Zeming
    Liu, Xiao
    Yang, Ruijie
    Huang, Di
    COMPUTER VISION - ECCV 2024, PT XXXIII, 2025, 15091 : 364 - 381
  • [23] Understanding and Constructing Latent Modality Structures in Multi-Modal Representation Learning
    Jiang, Qian
    Chen, Changyou
    Zhao, Han
    Chen, Liqun
    Ping, Qing
    Tran, Son Dinh
    Xu, Yi
    Zeng, Belinda
    Chilimbi, Trishul
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 7661 - 7671
  • [24] Multi-modal Network Representation Learning
    Zhang, Chuxu
    Jiang, Meng
    Zhang, Xiangliang
    Ye, Yanfang
    Chawla, Nitesh, V
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 3557 - 3558
  • [25] Global-local contrastive multiview representation learning for skeleton-based action
    Bian, Cunling
    Feng, Wei
    Meng, Fanbo
    Wang, Song
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 229
  • [26] Balanced Representation Learning for Long-tailed Skeleton-based Action Recognition
    Liu, Hongda
    Wang, Yunlong
    Ren, Min
    Hu, Junxing
    Luo, Zhengquan
    Hou, Guangqi
    Sun, Zhenan
    MACHINE INTELLIGENCE RESEARCH, 2025,
  • [27] Unsupervised multi-modal representation learning for affective computing with multi-corpus wearable data
    Kyle Ross
    Paul Hungler
    Ali Etemad
    Journal of Ambient Intelligence and Humanized Computing, 2023, 14 : 3199 - 3224
  • [28] Unsupervised multi-modal representation learning for affective computing with multi-corpus wearable data
    Ross, Kyle
    Hungler, Paul
    Etemad, Ali
    JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2021, 14 (4) : 3199 - 3224
  • [29] Reconstruction-driven contrastive learning for unsupervised skeleton-based human action recognition
    Liu, Xing
    Gao, Bo
    JOURNAL OF SUPERCOMPUTING, 2025, 81 (01):
  • [30] Robust Multi-Feature Learning for Skeleton-Based Action Recognition
    Wang, Yingfu
    Xu, Zheyuan
    Li, Li
    Yao, Jian
    IEEE ACCESS, 2019, 7 : 148658 - 148671