Linear-Complexity Self-Supervised Learning for Speech Processing

被引:0
|
作者
Zhang, Shucong [1 ]
Parcollet, Titouan [1 ]
van Dalen, Rogier [1 ]
Bhattacharya, Sourav [1 ]
机构
[1] Samsung AI Ctr Cambridge, Cambridge, England
来源
关键词
self-supervised learning; efficient models;
D O I
10.21437/Interspeech.2024-500
中图分类号
学科分类号
摘要
Self-supervised learning (SSL) models usually require weeks of pre-training with dozens of high-end GPUs. These models typically have a multi-headed self-attention (MHSA) context encoder. However, MHSA takes quadratic time and space in the input length, contributing to the high pre-training cost. Linear-complexity alternatives to MHSA have been proposed. For instance, in supervised training, the SummaryMixing model is the first to outperform MHSA across multiple speech processing tasks. However, these cheaper alternatives have not been explored for SSL yet. This paper studies a linear-complexity context encoder for SSL for the first time. With better or equivalent performance for the downstream tasks of the MP3S benchmark, SummaryMixing reduces the pre-training time and peak VRAM of wav2vec 2.0 model by 18% and by 23%, respectively, leading to the pre-training of a 155M wav2vec 2.0 model finished within one week with 4 Tesla A100 GPUs. Code(1) is available.
引用
收藏
页码:3480 / 3484
页数:5
相关论文
共 50 条
  • [1] Toward a realistic model of speech processing in the brain with self-supervised learning
    Millet, Juliette
    Caucheteux, Charlotte
    Orhan, Pierre
    Boubenec, Yves
    Gramfort, Alexandre
    Dunbar, Ewan
    Pallier, Christophe
    King, Jean-Remi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [2] Self-Supervised Speech Representation Learning: A Review
    Mohamed, Abdelrahman
    Lee, Hung-yi
    Borgholt, Lasse
    Havtorn, Jakob D.
    Edin, Joakim
    Igel, Christian
    Kirchhoff, Katrin
    Li, Shang-Wen
    Livescu, Karen
    Maaloe, Lars
    Sainath, Tara N.
    Watanabe, Shinji
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1179 - 1210
  • [3] Editorial Editorial of Special Issue on Self-Supervised Learning for Speech and Audio Processing
    Lee, Hung-Yi
    Watanabe, Shinji
    Livescu, Karen
    Mohamed, Abdelrahman
    Sainath, Tara
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1174 - 1178
  • [4] SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding
    Parcollet, Titouan
    van Dalen, Rogier
    Zhang, Shucong
    Bhattacharya, Sourav
    INTERSPEECH 2024, 2024, : 3460 - 3464
  • [5] CHARACTERIZING THE ADVERSARIAL VULNERABILITY OF SPEECH SELF-SUPERVISED LEARNING
    Wu, Haibin
    Zheng, Bo
    Li, Xu
    Wu, Xixin
    Lee, Hung-Yi
    Meng, Helen
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 3164 - 3168
  • [6] INVESTIGATING SELF-SUPERVISED LEARNING FOR SPEECH ENHANCEMENT AND SEPARATION
    Huang, Zili
    Watanabe, Shinji
    Yang, Shu-wen
    Garcia, Paola
    Khudanpur, Sanjeev
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6837 - 6841
  • [7] Self-Supervised Learning With Segmental Masking for Speech Representation
    Yue, Xianghu
    Lin, Jingru
    Gutierrez, Fabian Ritter
    Li, Haizhou
    IEEE Journal on Selected Topics in Signal Processing, 2022, 16 (06): : 1367 - 1379
  • [8] Self-Supervised Learning With Segmental Masking for Speech Representation
    Yue, Xianghu
    Lin, Jingru
    Gutierrez, Fabian Ritter
    Li, Haizhou
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1367 - 1379
  • [9] Phonetically Motivated Self-Supervised Speech Representation Learning
    Yue, Xianghu
    Li, Haizhou
    INTERSPEECH 2021, 2021, : 746 - 750
  • [10] TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech
    Liu, Andy T.
    Li, Shang-Wen
    Lee, Hung-yi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2351 - 2366