Cross-Modal Contrastive Pre-Training for Few-Shot Skeleton Action Recognition

被引:0
|
作者
Lu, Mingqi [1 ,2 ,3 ]
Yang, Siyuan [4 ]
Lu, Xiaobo [1 ,2 ]
Liu, Jun [3 ,5 ]
机构
[1] Southeast Univ, Sch Automat, Nanjing 210096, Peoples R China
[2] Southeast Univ, Key Lab Measurement & Control Complex Syst Engn, Minist Educ, Nanjing 210096, Peoples R China
[3] Singapore Univ Technol & Design, Informat Syst Technol & Design Pillar, Singapore 487372, Singapore
[4] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore
[5] Univ Lancaster, Sch Comp & Commun, Lancaster LA1 4YW, England
基金
中国国家自然科学基金;
关键词
Skeleton; Training; Feature extraction; Image recognition; Task analysis; Metalearning; Computational modeling; Few-shot skeleton action recognition; contrastive learning; knowledge distillation; SELF;
D O I
10.1109/TCSVT.2024.3402952
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This paper proposes a novel approach for few-shot skeleton action recognition that comprises of two stages: cross-modal pre-training of a skeleton encoder, followed by fine-tuning of a cosine classifier on the support set. The pre-training and fine-tuning approach has been demonstrated to be more effective for handling few-shot tasks compared to utilizing more intricate meta-learning methods. However, its success relies on the availability of a large-scale training dataset, which yet is difficult to obtain. To address this challenge, we introduce a cross-modal pre-training framework based on Bootstrap Your Own Latent (BYOL), which considers skeleton sequences and their corresponding videos as augmented views of the same action in different modalities. By utilizing a simple regression loss, the framework is able to transfer robust and high-quality vision-language representations to the skeleton encoder. This allows the skeleton encoder to gain a comprehensive understanding of action sequences and benefit from the prior knowledge obtained from a vision-language pre-trained model. The representation transfer enhances the feature extraction capability of the skeleton encoder, compensating for the lack of large-scale skeleton datasets. Extensive experiments on the NTU RGB+D, NTU RGB+D 120, PKU-MMD, NW-UCLA, and MSR Action Pairs datasets demonstrate that our proposed approach achieves state-of-the-art performances for few-shot skeleton action recognition.
引用
收藏
页码:9798 / 9807
页数:10
相关论文
共 50 条
  • [31] Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment
    Wang, Runqi
    Zheng, Hao
    Duan, Xiaoyue
    Liu, Jianzhuang
    Lu, Yuning
    Wang, Tian
    Xu, Songcen
    Zhang, Baochang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23445 - 23454
  • [32] Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering
    Zhang, Jing
    Liu, Xiaoqiang
    Chen, Mingzhe
    Wang, Zhe
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7151 - 7159
  • [33] Few-Shot Image and Sentence Matching via Aligned Cross-Modal Memory
    Huang, Yan
    Wang, Jingdong
    Wang, Liang
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (06) : 2968 - 2983
  • [34] Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models
    Lin, Zhiqiu
    Yu, Samuel
    Kuang, Zhiyi
    Pathak, Deepak
    Ramanan, Deva
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19325 - 19337
  • [35] Omni-Training: Bridging Pre-Training and Meta-Training for Few-Shot Learning
    Shu, Yang
    Cao, Zhangjie
    Gao, Jinghan
    Wang, Jianmin
    Yu, Philip S.
    Long, Mingsheng
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (12) : 15275 - 15291
  • [36] ACMM: Aligned Cross-Modal Memory for Few-Shot Image and Sentence Matching
    Huang, Yan
    Wang, Liang
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5773 - 5782
  • [37] Learning Spatial-Preserved Skeleton Representations for Few-Shot Action Recognition
    Ma, Ning
    Zhang, Hongyi
    Li, Xuhui
    Zhou, Sheng
    Zhang, Zhen
    Wen, Jun
    Li, Haifeng
    Gu, Jingjun
    Bu, Jiajun
    COMPUTER VISION - ECCV 2022, PT IV, 2022, 13664 : 174 - 191
  • [38] CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising*
    Luo, Jianjie
    Li, Yehao
    Pan, Yingwei
    Yao, Ting
    Chao, Hongyang
    Mei, Tao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5600 - 5608
  • [39] CONTRASTIVE REPRESENTATION FOR FEW-SHOT VEHICLE FOOTPRINT RECOGNITION
    Wang, Yongxiong
    Hu, Chuanfei
    Wang, Guangpeng
    Lin, Xu
    2021 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2021,
  • [40] RF-CM: Cross-Modal Framework for RF-enabled Few-Shot Human Activity Recognition
    Wang, Xuan
    Liu, Tong
    Feng, Chao
    Fang, Dingyi
    Chen, Xiaojiang
    PROCEEDINGS OF THE ACM ON INTERACTIVE MOBILE WEARABLE AND UBIQUITOUS TECHNOLOGIES-IMWUT, 2023, 7 (01):