Cross-Modal Contrastive Pre-Training for Few-Shot Skeleton Action Recognition

被引:0
|
作者
Lu, Mingqi [1 ,2 ,3 ]
Yang, Siyuan [4 ]
Lu, Xiaobo [1 ,2 ]
Liu, Jun [3 ,5 ]
机构
[1] Southeast Univ, Sch Automat, Nanjing 210096, Peoples R China
[2] Southeast Univ, Key Lab Measurement & Control Complex Syst Engn, Minist Educ, Nanjing 210096, Peoples R China
[3] Singapore Univ Technol & Design, Informat Syst Technol & Design Pillar, Singapore 487372, Singapore
[4] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore
[5] Univ Lancaster, Sch Comp & Commun, Lancaster LA1 4YW, England
基金
中国国家自然科学基金;
关键词
Skeleton; Training; Feature extraction; Image recognition; Task analysis; Metalearning; Computational modeling; Few-shot skeleton action recognition; contrastive learning; knowledge distillation; SELF;
D O I
10.1109/TCSVT.2024.3402952
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This paper proposes a novel approach for few-shot skeleton action recognition that comprises of two stages: cross-modal pre-training of a skeleton encoder, followed by fine-tuning of a cosine classifier on the support set. The pre-training and fine-tuning approach has been demonstrated to be more effective for handling few-shot tasks compared to utilizing more intricate meta-learning methods. However, its success relies on the availability of a large-scale training dataset, which yet is difficult to obtain. To address this challenge, we introduce a cross-modal pre-training framework based on Bootstrap Your Own Latent (BYOL), which considers skeleton sequences and their corresponding videos as augmented views of the same action in different modalities. By utilizing a simple regression loss, the framework is able to transfer robust and high-quality vision-language representations to the skeleton encoder. This allows the skeleton encoder to gain a comprehensive understanding of action sequences and benefit from the prior knowledge obtained from a vision-language pre-trained model. The representation transfer enhances the feature extraction capability of the skeleton encoder, compensating for the lack of large-scale skeleton datasets. Extensive experiments on the NTU RGB+D, NTU RGB+D 120, PKU-MMD, NW-UCLA, and MSR Action Pairs datasets demonstrate that our proposed approach achieves state-of-the-art performances for few-shot skeleton action recognition.
引用
收藏
页码:9798 / 9807
页数:10
相关论文
共 50 条
  • [1] Cross-Modal Contrastive Learning Network for Few-Shot Action Recognition
    Wang, Xiao
    Yan, Yan
    Hu, Hai-Miao
    Li, Bo
    Wang, Hanzi
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1257 - 1271
  • [2] Synergistic Anchored Contrastive Pre-training for Few-Shot Relation Extraction
    Luo, Da
    Gan, Yanglei
    Hou, Rui
    Lin, Run
    Liu, Qiao
    Cai, Yuxiang
    Gao, Wannian
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 18742 - 18750
  • [3] Few-shot activity recognition with cross-modal memory network
    Zhang, Lingling
    Chang, Xiaojun
    Liu, Jun
    Luo, Minnan
    Prakash, Mahesh
    Hauptmann, Alexander G.
    PATTERN RECOGNITION, 2020, 108
  • [4] Adaptive Cross-Modal Few-shot Learning
    Xing, Chen
    Rostamzadeh, Negar
    Oreshkin, Boris N.
    Pinheiro, Pedro O.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [5] Cross-modal guides spatio-temporal enrichment network for few-shot action recognition
    Chen, Zhiwen
    Yang, Yi
    Li, Li
    Li, Min
    APPLIED INTELLIGENCE, 2024, 54 (22) : 11196 - 11211
  • [6] Few-Shot Intent Detection via Contrastive Pre-Training and Fine-Tuning
    Zhang, Jian-Guo
    Bui, Trung
    Yoon, Seunghyun
    Chen, Xiang
    Liu, Zhiwei
    Xia, Congying
    Tran, Quan Hung
    Chang, Walter
    Yu, Philip
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 1906 - 1912
  • [7] Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation
    Jiang, Chaoya
    Ye, Wei
    Xu, Haiyang
    Huang, Songfang
    Huang, Fei
    Zhang, Shikun
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 14660 - 14679
  • [8] Effectiveness of Pre-training for Few-shot Intent Classification
    Zhang, Haode
    Zhang, Yuwei
    Zhan, Li-Ming
    Chen, Jiaxin
    Shi, Guangyuan
    Wu, Xiao-Ming
    Lam, Albert Y. S.
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 1114 - 1120
  • [9] RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training
    Zhou, Chulun
    Liang, Yunlong
    Meng, Fandong
    Xu, Jinan
    Su, Jinsong
    Zhou, Jie
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 11747 - 11762
  • [10] Contrastive Cross-Modal Pre-Training: A General Strategy for Small Sample Medical Imaging
    Liang, Gongbo
    Greenwell, Connor
    Zhang, Yu
    Xing, Xin
    Wang, Xiaoqin
    Kavuluru, Ramakanth
    Jacobs, Nathan
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (04) : 1640 - 1649