SVQ-MAE: an efficient speech pre-training framework with constrained computational resources

被引：0

作者：

Zhuang, Xuyi ^{[1
]}

Qian, Yukun ^{[1
]}

Wang, Mingjiang ^{[1
]}

机构：

[1] Harbin Inst Technol, Lab Key Technol IoT Terminals, Shenzhen, Peoples R China

来源：

EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING | 2024年 / 2024卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Self-supervised learning; MAE; Vector quantization; Constrained computational resources; REPRESENTATION;

D O I：

10.1186/s13636-024-00375-1

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Self-supervised learning for speech pre-training models has achieved remarkable success in acquiring superior speech contextual representations by learning from unlabeled audio, excelling in numerous downstream speech tasks. However, the pre-training of these models necessitates significant computational resources and training duration, presenting a high barrier to entry into the realm of pre-training learning. In our efforts, by amalgamating the resource-efficient benefits of the generative learning model, Masked Auto Encoder, with the efficacy of the vector quantization method in discriminative learning, we introduce a novel pre-training framework: Speech Vector Quantization Masked Auto Encoder (SVQ-MAE). Distinct from the majority of SSL frameworks, which require simultaneous construction of speech contextual representations and mask reconstruction within an encoder-only module, we have exclusively designed a decoupled decoder for pre-training SVQ-MAE. This allows the additional decoupled decoder to undertake the mask reconstruction task solely, reducing the learning complexity of pretext tasks and enhancing the encoder's efficiency in extracting speech contextual representations. Owing to this innovation, by using only 4 GPUs, SVQ-NAE can achieve high performance comparable to wav2vec 2.0, which requires 64 GPUs for training. In the Speech Processing Universal Performance Benchmark, SVQ-MAE surpasses wav2vec 2.0 in both keyword spotting and emotion recognition tasks. Furthermore, in cross-lingual ASR for Mandarin, upon fine-tuning on AISHELL-1, SVQ-MAE achieves a Character Error Rate of 4.09%, outperforming all supervised ASR models.

引用

页数：16

共 50 条

[31] UNSUPERVISED PRE-TRAINING OF BIDIRECTIONAL SPEECH ENCODERS VIA MASKED RECONSTRUCTION
Wang, Weiran
Tang, Qingming
Livescu, Karen
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6889 - 6893
[32] Reducing Domain mismatch in Self-supervised speech pre-training
Baskar, Murali Karthick
Rosenberg, Andrew
Ramabhadran, Bhuvana
Zhang, Yu
INTERSPEECH 2022, 2022, : 3028 - 3032
[33] SPEECH ENHANCEMENT WITH MIXTURE OF DEEP EXPERTS WITH CLEAN CLUSTERING PRE-TRAINING
Chazan, Shlomo E.
Goldberger, Jacob
Gannot, Sharon
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 716 - 720
[34] wav2vec: Unsupervised Pre-training for Speech Recognition
Schneider, Steffen
Baevski, Alexei
Collobert, Ronan
Auli, Michael
INTERSPEECH 2019, 2019, : 3465 - 3469
[35] TWO-STAGE PRE-TRAINING FOR SEQUENCE TO SEQUENCE SPEECH RECOGNITION
Fan, Zhiyun
Zhou, Shiyu
Xu, Bo
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
[36] Efficient Pre-training for Localized Instruction Generation of Procedural Videos
Batra, Anil
Moltisanti, Davide
Sevilla-Lara, Laura
Rohrbach, Marcus
Keller, Frank
COMPUTER VISION - ECCV 2024, PT XXXIX, 2025, 15097 : 347 - 363
[37] CyclicFL: Efficient Federated Learning with Cyclic Model Pre-Training
Zhang, Pengyu
Zhou, Yingbo
Hu, Ming
Wei, Xian
Chen, Mingsong
JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2025,
[38] Efficient Image Pre-training with Siamese Cropped Masked Autoencoders
Eymael, Alexandre
Vandeghen, Renaud
Cioppa, Anthony
Giancola, Silvio
Ghanem, Bernard
Van Droogenbroeck, Marc
COMPUTER VISION - ECCV 2024, PT XXIII, 2025, 15081 : 348 - 366
[39] SENTIMENT-AWARE AUTOMATIC SPEECH RECOGNITION PRE-TRAINING FOR ENHANCED SPEECH EMOTION RECOGNITION
Ghriss, Ayoub
Yang, Bo
Rozgic, Viktor
Shriberg, Elizabeth
Wang, Chao
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7347 - 7351
[40] POSPAN: Position-Constrained Span Masking for Language Model Pre-training
Zhang, Zhenyu
Shen, Lei
Zhao, Yuming
Chen, Meng
He, Xiaodong
PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 4420 - 4424

← 1 2 3 4 5 →