Linear-Complexity Self-Supervised Learning for Speech Processing

被引：0

作者：

Zhang, Shucong ^{[1
]}

Parcollet, Titouan ^{[1
]}

van Dalen, Rogier ^{[1
]}

Bhattacharya, Sourav ^{[1
]}

机构：

[1] Samsung AI Ctr Cambridge, Cambridge, England

来源：

INTERSPEECH 2024 | 2024年

关键词：

self-supervised learning; efficient models;

D O I：

10.21437/Interspeech.2024-500

中图分类号：

学科分类号：

摘要：

Self-supervised learning (SSL) models usually require weeks of pre-training with dozens of high-end GPUs. These models typically have a multi-headed self-attention (MHSA) context encoder. However, MHSA takes quadratic time and space in the input length, contributing to the high pre-training cost. Linear-complexity alternatives to MHSA have been proposed. For instance, in supervised training, the SummaryMixing model is the first to outperform MHSA across multiple speech processing tasks. However, these cheaper alternatives have not been explored for SSL yet. This paper studies a linear-complexity context encoder for SSL for the first time. With better or equivalent performance for the downstream tasks of the MP3S benchmark, SummaryMixing reduces the pre-training time and peak VRAM of wav2vec 2.0 model by 18% and by 23%, respectively, leading to the pre-training of a 155M wav2vec 2.0 model finished within one week with 4 Tesla A100 GPUs. Code(1) is available.

引用

页码：3480 / 3484

页数：5

共 50 条

[1] Toward a realistic model of speech processing in the brain with self-supervised learning
Millet, Juliette
Caucheteux, Charlotte
Orhan, Pierre
Boubenec, Yves
Gramfort, Alexandre
Dunbar, Ewan
Pallier, Christophe
King, Jean-Remi
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[2] Self-Supervised Speech Representation Learning: A Review
Mohamed, Abdelrahman
Lee, Hung-yi
Borgholt, Lasse
Havtorn, Jakob D.
Edin, Joakim
Igel, Christian
Kirchhoff, Katrin
Li, Shang-Wen
Livescu, Karen
Maaloe, Lars
Sainath, Tara N.
Watanabe, Shinji
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1179 - 1210
[3] Editorial Editorial of Special Issue on Self-Supervised Learning for Speech and Audio Processing
Lee, Hung-Yi
Watanabe, Shinji
Livescu, Karen
Mohamed, Abdelrahman
Sainath, Tara
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1174 - 1178
[4] SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding
Parcollet, Titouan
van Dalen, Rogier
Zhang, Shucong
Bhattacharya, Sourav
INTERSPEECH 2024, 2024, : 3460 - 3464
[5] CHARACTERIZING THE ADVERSARIAL VULNERABILITY OF SPEECH SELF-SUPERVISED LEARNING
Wu, Haibin
Zheng, Bo
Li, Xu
Wu, Xixin
Lee, Hung-Yi
Meng, Helen
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 3164 - 3168
[6] INVESTIGATING SELF-SUPERVISED LEARNING FOR SPEECH ENHANCEMENT AND SEPARATION
Huang, Zili
Watanabe, Shinji
Yang, Shu-wen
Garcia, Paola
Khudanpur, Sanjeev
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6837 - 6841
[7] Self-Supervised Learning With Segmental Masking for Speech Representation
Yue, Xianghu
Lin, Jingru
Gutierrez, Fabian Ritter
Li, Haizhou
IEEE Journal on Selected Topics in Signal Processing, 2022, 16 (06): : 1367 - 1379
[8] Self-Supervised Learning With Segmental Masking for Speech Representation
Yue, Xianghu
Lin, Jingru
Gutierrez, Fabian Ritter
Li, Haizhou
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1367 - 1379
[9] Phonetically Motivated Self-Supervised Speech Representation Learning
Yue, Xianghu
Li, Haizhou
INTERSPEECH 2021, 2021, : 746 - 750
[10] TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech
Liu, Andy T.
Li, Shang-Wen
Lee, Hung-yi
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2351 - 2366

← 1 2 3 4 5 →