SVQ-MAE: an efficient speech pre-training framework with constrained computational resources

被引:0
|
作者
Zhuang, Xuyi [1 ]
Qian, Yukun [1 ]
Wang, Mingjiang [1 ]
机构
[1] Harbin Inst Technol, Lab Key Technol IoT Terminals, Shenzhen, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Self-supervised learning; MAE; Vector quantization; Constrained computational resources; REPRESENTATION;
D O I
10.1186/s13636-024-00375-1
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Self-supervised learning for speech pre-training models has achieved remarkable success in acquiring superior speech contextual representations by learning from unlabeled audio, excelling in numerous downstream speech tasks. However, the pre-training of these models necessitates significant computational resources and training duration, presenting a high barrier to entry into the realm of pre-training learning. In our efforts, by amalgamating the resource-efficient benefits of the generative learning model, Masked Auto Encoder, with the efficacy of the vector quantization method in discriminative learning, we introduce a novel pre-training framework: Speech Vector Quantization Masked Auto Encoder (SVQ-MAE). Distinct from the majority of SSL frameworks, which require simultaneous construction of speech contextual representations and mask reconstruction within an encoder-only module, we have exclusively designed a decoupled decoder for pre-training SVQ-MAE. This allows the additional decoupled decoder to undertake the mask reconstruction task solely, reducing the learning complexity of pretext tasks and enhancing the encoder's efficiency in extracting speech contextual representations. Owing to this innovation, by using only 4 GPUs, SVQ-NAE can achieve high performance comparable to wav2vec 2.0, which requires 64 GPUs for training. In the Speech Processing Universal Performance Benchmark, SVQ-MAE surpasses wav2vec 2.0 in both keyword spotting and emotion recognition tasks. Furthermore, in cross-lingual ASR for Mandarin, upon fine-tuning on AISHELL-1, SVQ-MAE achieves a Character Error Rate of 4.09%, outperforming all supervised ASR models.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] A Multilingual Framework Based on Pre-training Model for Speech Emotion Recognition
    Zhang, Zhaohang
    Zhang, Xiaohui
    Guo, Min
    Zhang, Wei-Qiang
    Li, Ke
    Huang, Yukai
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 750 - 755
  • [2] GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds
    Yang, Honghui
    He, Tong
    Liu, Jiaheng
    Chen, Hua
    Wu, Boxi
    Lin, Binbin
    He, Xiaofei
    Ouyang, Wanli
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 9403 - 9414
  • [3] Speech Pre-training with Acoustic Piece
    Ren, Shuo
    Liu, Shujie
    Wu, Yu
    Zhou, Long
    Wei, Furu
    INTERSPEECH 2022, 2022, : 2648 - 2652
  • [4] An efficient interactive segmentation framework for medical images without pre-training
    Sun, Lei
    Tian, Zhiqiang
    Chen, Zhang
    Luo, Wenrui
    Du, Shaoyi
    MEDICAL PHYSICS, 2023, 50 (04) : 2239 - 2248
  • [5] FEDBFPT: An Efficient Federated Learning Framework for BERT Further Pre-training
    Wang, Xin'ao
    Li, Huan
    Chen, Ke
    Shou, Lidan
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 4344 - 4352
  • [6] Neural speech enhancement with unsupervised pre-training and mixture training
    Hao, Xiang
    Xu, Chenglin
    Xie, Lei
    NEURAL NETWORKS, 2023, 158 : 216 - 227
  • [7] SELF-TRAINING AND PRE-TRAINING ARE COMPLEMENTARY FOR SPEECH RECOGNITION
    Xu, Qiantong
    Baevski, Alexei
    Likhomanenko, Tatiana
    Tomasello, Paden
    Conneau, Alexis
    Collobert, Ronan
    Synnaeve, Gabriel
    Auli, Michael
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3030 - 3034
  • [8] Unified Speech-Text Pre-training for Speech Translation and Recognition
    Tang, Yun
    Gong, Hongyu
    Dong, Ning
    Wang, Changhan
    Hsu, Wei-Ning
    Gu, Jiatao
    Baevski, Alexei
    Li, Xian
    Mohamed, Abdelrahman
    Auli, Michael
    Pino, Juan
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1488 - 1499
  • [9] GENERATIVE PRE-TRAINING FOR SPEECH WITH AUTOREGRESSIVE PREDICTIVE CODING
    Chung, Yu-An
    Glass, James
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3497 - 3501
  • [10] Efficient Conditional Pre-training for Transfer Learning
    Chakraborty, Shuvam
    Uzkent, Burak
    Ayush, Kumar
    Tanmay, Kumar
    Sheehan, Evan
    Ermon, Stefano
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4240 - 4249