SVQ-MAE: an efficient speech pre-training framework with constrained computational resources

被引:0
|
作者
Zhuang, Xuyi [1 ]
Qian, Yukun [1 ]
Wang, Mingjiang [1 ]
机构
[1] Harbin Inst Technol, Lab Key Technol IoT Terminals, Shenzhen, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Self-supervised learning; MAE; Vector quantization; Constrained computational resources; REPRESENTATION;
D O I
10.1186/s13636-024-00375-1
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Self-supervised learning for speech pre-training models has achieved remarkable success in acquiring superior speech contextual representations by learning from unlabeled audio, excelling in numerous downstream speech tasks. However, the pre-training of these models necessitates significant computational resources and training duration, presenting a high barrier to entry into the realm of pre-training learning. In our efforts, by amalgamating the resource-efficient benefits of the generative learning model, Masked Auto Encoder, with the efficacy of the vector quantization method in discriminative learning, we introduce a novel pre-training framework: Speech Vector Quantization Masked Auto Encoder (SVQ-MAE). Distinct from the majority of SSL frameworks, which require simultaneous construction of speech contextual representations and mask reconstruction within an encoder-only module, we have exclusively designed a decoupled decoder for pre-training SVQ-MAE. This allows the additional decoupled decoder to undertake the mask reconstruction task solely, reducing the learning complexity of pretext tasks and enhancing the encoder's efficiency in extracting speech contextual representations. Owing to this innovation, by using only 4 GPUs, SVQ-NAE can achieve high performance comparable to wav2vec 2.0, which requires 64 GPUs for training. In the Speech Processing Universal Performance Benchmark, SVQ-MAE surpasses wav2vec 2.0 in both keyword spotting and emotion recognition tasks. Furthermore, in cross-lingual ASR for Mandarin, upon fine-tuning on AISHELL-1, SVQ-MAE achieves a Character Error Rate of 4.09%, outperforming all supervised ASR models.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] SEPT: Towards Scalable and Efficient Visual Pre-training
    Lin, Yiqi
    Zheng, Huabin
    Zhong, Huaping
    Zhu, Jinjing
    Li, Weijia
    He, Conghui
    Wang, Lin
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 2, 2023, : 1622 - 1630
  • [22] Fast and Efficient Multilingual Self-Supervised Pre-training for Low-Resource Speech Recognition
    Zhang, Zhilong
    Wang, Wei
    Qian, Yanmin
    INTERSPEECH 2023, 2023, : 2248 - 2252
  • [23] Forecast-MAE: Self-supervised Pre-training for Motion Forecasting with Masked Autoencoders
    Cheng, Jie
    Mei, Xiaodong
    Liu, Ming
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 8645 - 8655
  • [24] Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition
    Zhang, Wangyou
    Qian, Yanmin
    INTERSPEECH 2023, 2023, : 3517 - 3521
  • [25] An Adaptive Graph Pre-training Framework for Localized Collaborative Filtering
    Wang, Yiqi
    Li, Chaozhuo
    Liu, Zheng
    Li, Mingzheng
    Tang, Jiliang
    Xie, Xing
    Chen, Lei
    Yu, Philip S.
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2023, 41 (02)
  • [26] A unified pre-training and adaptation framework for combinatorial optimization on graphs
    Zeng, Ruibin
    Lei, Minglong
    Niu, Lingfeng
    Cheng, Lan
    SCIENCE CHINA-MATHEMATICS, 2024, 67 (06) : 1439 - 1456
  • [27] A unified pre-training and adaptation framework for combinatorial optimization on graphs
    Ruibin Zeng
    Minglong Lei
    Lingfeng Niu
    Lan Cheng
    Science China(Mathematics), 2024, 67 (06) : 1439 - 1456
  • [28] ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding
    Sun, Yu
    Wang, Shuohuan
    Li, Yukun
    Feng, Shikun
    Tian, Hao
    Wu, Hua
    Wang, Haifeng
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 8968 - 8975
  • [29] Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training
    Wang, Chengyi
    Wang, Yiming
    Wu, Yu
    Chen, Sanyuan
    Li, Jinyu
    Liu, Shujie
    Wei, Furu
    INTERSPEECH 2022, 2022, : 2643 - 2647
  • [30] A Study into Pre-training Strategies for Spoken Language Understanding on Dysarthric Speech
    Wang, Pu
    BabaAli, Bagher
    Van Hamme, Hugo
    INTERSPEECH 2021, 2021, : 36 - 40