SVQ-MAE: an efficient speech pre-training framework with constrained computational resources

被引:0
|
作者
Zhuang, Xuyi [1 ]
Qian, Yukun [1 ]
Wang, Mingjiang [1 ]
机构
[1] Harbin Inst Technol, Lab Key Technol IoT Terminals, Shenzhen, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Self-supervised learning; MAE; Vector quantization; Constrained computational resources; REPRESENTATION;
D O I
10.1186/s13636-024-00375-1
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Self-supervised learning for speech pre-training models has achieved remarkable success in acquiring superior speech contextual representations by learning from unlabeled audio, excelling in numerous downstream speech tasks. However, the pre-training of these models necessitates significant computational resources and training duration, presenting a high barrier to entry into the realm of pre-training learning. In our efforts, by amalgamating the resource-efficient benefits of the generative learning model, Masked Auto Encoder, with the efficacy of the vector quantization method in discriminative learning, we introduce a novel pre-training framework: Speech Vector Quantization Masked Auto Encoder (SVQ-MAE). Distinct from the majority of SSL frameworks, which require simultaneous construction of speech contextual representations and mask reconstruction within an encoder-only module, we have exclusively designed a decoupled decoder for pre-training SVQ-MAE. This allows the additional decoupled decoder to undertake the mask reconstruction task solely, reducing the learning complexity of pretext tasks and enhancing the encoder's efficiency in extracting speech contextual representations. Owing to this innovation, by using only 4 GPUs, SVQ-NAE can achieve high performance comparable to wav2vec 2.0, which requires 64 GPUs for training. In the Speech Processing Universal Performance Benchmark, SVQ-MAE surpasses wav2vec 2.0 in both keyword spotting and emotion recognition tasks. Furthermore, in cross-lingual ASR for Mandarin, upon fine-tuning on AISHELL-1, SVQ-MAE achieves a Character Error Rate of 4.09%, outperforming all supervised ASR models.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] UNSUPERVISED PRE-TRAINING OF BIDIRECTIONAL SPEECH ENCODERS VIA MASKED RECONSTRUCTION
    Wang, Weiran
    Tang, Qingming
    Livescu, Karen
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6889 - 6893
  • [32] Reducing Domain mismatch in Self-supervised speech pre-training
    Baskar, Murali Karthick
    Rosenberg, Andrew
    Ramabhadran, Bhuvana
    Zhang, Yu
    INTERSPEECH 2022, 2022, : 3028 - 3032
  • [33] SPEECH ENHANCEMENT WITH MIXTURE OF DEEP EXPERTS WITH CLEAN CLUSTERING PRE-TRAINING
    Chazan, Shlomo E.
    Goldberger, Jacob
    Gannot, Sharon
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 716 - 720
  • [34] wav2vec: Unsupervised Pre-training for Speech Recognition
    Schneider, Steffen
    Baevski, Alexei
    Collobert, Ronan
    Auli, Michael
    INTERSPEECH 2019, 2019, : 3465 - 3469
  • [35] TWO-STAGE PRE-TRAINING FOR SEQUENCE TO SEQUENCE SPEECH RECOGNITION
    Fan, Zhiyun
    Zhou, Shiyu
    Xu, Bo
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [36] Efficient Pre-training for Localized Instruction Generation of Procedural Videos
    Batra, Anil
    Moltisanti, Davide
    Sevilla-Lara, Laura
    Rohrbach, Marcus
    Keller, Frank
    COMPUTER VISION - ECCV 2024, PT XXXIX, 2025, 15097 : 347 - 363
  • [37] CyclicFL: Efficient Federated Learning with Cyclic Model Pre-Training
    Zhang, Pengyu
    Zhou, Yingbo
    Hu, Ming
    Wei, Xian
    Chen, Mingsong
    JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2025,
  • [38] Efficient Image Pre-training with Siamese Cropped Masked Autoencoders
    Eymael, Alexandre
    Vandeghen, Renaud
    Cioppa, Anthony
    Giancola, Silvio
    Ghanem, Bernard
    Van Droogenbroeck, Marc
    COMPUTER VISION - ECCV 2024, PT XXIII, 2025, 15081 : 348 - 366
  • [39] SENTIMENT-AWARE AUTOMATIC SPEECH RECOGNITION PRE-TRAINING FOR ENHANCED SPEECH EMOTION RECOGNITION
    Ghriss, Ayoub
    Yang, Bo
    Rozgic, Viktor
    Shriberg, Elizabeth
    Wang, Chao
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7347 - 7351
  • [40] POSPAN: Position-Constrained Span Masking for Language Model Pre-training
    Zhang, Zhenyu
    Shen, Lei
    Zhao, Yuming
    Chen, Meng
    He, Xiaodong
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 4420 - 4424