Cross-Modal Contrastive Pre-Training for Few-Shot Skeleton Action Recognition

被引:0
|
作者
Lu, Mingqi [1 ,2 ,3 ]
Yang, Siyuan [4 ]
Lu, Xiaobo [1 ,2 ]
Liu, Jun [3 ,5 ]
机构
[1] Southeast Univ, Sch Automat, Nanjing 210096, Peoples R China
[2] Southeast Univ, Key Lab Measurement & Control Complex Syst Engn, Minist Educ, Nanjing 210096, Peoples R China
[3] Singapore Univ Technol & Design, Informat Syst Technol & Design Pillar, Singapore 487372, Singapore
[4] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore
[5] Univ Lancaster, Sch Comp & Commun, Lancaster LA1 4YW, England
基金
中国国家自然科学基金;
关键词
Skeleton; Training; Feature extraction; Image recognition; Task analysis; Metalearning; Computational modeling; Few-shot skeleton action recognition; contrastive learning; knowledge distillation; SELF;
D O I
10.1109/TCSVT.2024.3402952
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This paper proposes a novel approach for few-shot skeleton action recognition that comprises of two stages: cross-modal pre-training of a skeleton encoder, followed by fine-tuning of a cosine classifier on the support set. The pre-training and fine-tuning approach has been demonstrated to be more effective for handling few-shot tasks compared to utilizing more intricate meta-learning methods. However, its success relies on the availability of a large-scale training dataset, which yet is difficult to obtain. To address this challenge, we introduce a cross-modal pre-training framework based on Bootstrap Your Own Latent (BYOL), which considers skeleton sequences and their corresponding videos as augmented views of the same action in different modalities. By utilizing a simple regression loss, the framework is able to transfer robust and high-quality vision-language representations to the skeleton encoder. This allows the skeleton encoder to gain a comprehensive understanding of action sequences and benefit from the prior knowledge obtained from a vision-language pre-trained model. The representation transfer enhances the feature extraction capability of the skeleton encoder, compensating for the lack of large-scale skeleton datasets. Extensive experiments on the NTU RGB+D, NTU RGB+D 120, PKU-MMD, NW-UCLA, and MSR Action Pairs datasets demonstrate that our proposed approach achieves state-of-the-art performances for few-shot skeleton action recognition.
引用
收藏
页码:9798 / 9807
页数:10
相关论文
共 50 条
  • [21] Cross-modal de-deviation for enhancing few-shot classification
    Pan, Mei -Hong
    Shen, Hong -Bin
    PATTERN RECOGNITION, 2024, 152
  • [22] Few-Shot Cross-Lingual Stance Detection with Sentiment-Based Pre-training
    Hardalov, Momchil
    Arora, Arnav
    Nakov, Preslav
    Augenstein, Isabelle
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 10729 - 10737
  • [23] Label Semantic Aware Pre-training for Few-shot Text Classification
    Mueller, Aaron
    Krone, Jason
    Romeo, Salvatore
    Mansour, Saab
    Mansimov, Elman
    Zhang, Yi
    Roth, Dan
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 8318 - 8334
  • [24] Multitask Pre-training of Modular Prompt for Chinese Few-Shot Learning
    Sun, Tianxiang
    He, Zhengfu
    Zhu, Qin
    Qiu, Xipeng
    Huang, Xuanjing
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 11156 - 11172
  • [25] DCMA-Net: dual cross-modal attention for fine-grained few-shot recognition
    Yan Zhou
    Xiao Ren
    Jianxun Li
    Yin Yang
    Haibin Zhou
    Multimedia Tools and Applications, 2024, 83 : 14521 - 14537
  • [26] Img2Acoustic: A Cross-Modal Gesture Recognition Method Based on Few-Shot Learning
    Zou, Yongpan
    Weng, Jianhao
    Kuang, Wenting
    Jiao, Yang
    Leung, Victor C. M.
    Wu, Kaishun
    IEEE TRANSACTIONS ON MOBILE COMPUTING, 2025, 24 (03) : 1496 - 1512
  • [27] DCMA-Net: dual cross-modal attention for fine-grained few-shot recognition
    Zhou, Yan
    Ren, Xiao
    Li, Jianxun
    Yang, Yin
    Zhou, Haibin
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (05) : 14521 - 14537
  • [28] CTR: Contrastive Training Recognition Classifier for Few-Shot Open-World Recognition
    Dionelis, Nikolaos
    Tsaftaris, Sotirios A.
    Yaghoobi, Mehrdad
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1792 - 1799
  • [29] UniXcoder: Unified Cross-Modal Pre-training for Code Representation
    Guo, Daya
    Lu, Shuai
    Duan, Nan
    Wang, Yanlin
    Zhou, Ming
    Yin, Jian
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7212 - 7225
  • [30] Unified Multi-modal Pre-training for Few-shot Sentiment Analysis with Prompt-based Learning
    Yu, Yang
    Zhang, Dong
    Li, Shoushan
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,