Linear-Complexity Self-Supervised Learning for Speech Processing

被引：0

作者：

Zhang, Shucong ^{[1
]}

Parcollet, Titouan ^{[1
]}

van Dalen, Rogier ^{[1
]}

Bhattacharya, Sourav ^{[1
]}

机构：

[1] Samsung AI Ctr Cambridge, Cambridge, England

来源：

INTERSPEECH 2024 | 2024年

关键词：

self-supervised learning; efficient models;

D O I：

10.21437/Interspeech.2024-500

中图分类号：

学科分类号：

摘要：

Self-supervised learning (SSL) models usually require weeks of pre-training with dozens of high-end GPUs. These models typically have a multi-headed self-attention (MHSA) context encoder. However, MHSA takes quadratic time and space in the input length, contributing to the high pre-training cost. Linear-complexity alternatives to MHSA have been proposed. For instance, in supervised training, the SummaryMixing model is the first to outperform MHSA across multiple speech processing tasks. However, these cheaper alternatives have not been explored for SSL yet. This paper studies a linear-complexity context encoder for SSL for the first time. With better or equivalent performance for the downstream tasks of the MP3S benchmark, SummaryMixing reduces the pre-training time and peak VRAM of wav2vec 2.0 model by 18% and by 23%, respectively, leading to the pre-training of a 155M wav2vec 2.0 model finished within one week with 4 Tesla A100 GPUs. Code(1) is available.

引用

页码：3480 / 3484

页数：5

共 50 条

[31] Progressive Multi-scale Self-supervised Learning for Speech Recognition
Wan, Genshun
Chen, Hang
Liu, Tan
Wang, Chenxi
Pan, Jia
Ye, Zhongfu
2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 978 - 982
[32] IMPROVING SELF-SUPERVISED LEARNING FOR SPEECH RECOGNITION WITH INTERMEDIATE LAYER SUPERVISION
Wang, Chengyi
Wu, Yu
Chen, Sanyuan
Liu, Shujie
Li, Jinyu
Qian, Yao
Yang, Zhenglu
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7092 - 7096
[33] On the (In)Efficiency of Acoustic Feature Extractors for Self-Supervised Speech Representation Learning
Parcollet, Titouan
Zhang, Shucong
Ramos, Alberto Gil C. P.
van Dalen, Rogier
Bhattacharya, Sourav
INTERSPEECH 2023, 2023, : 581 - 585
[34] Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction
Mu, Zhaoxi
Yang, Xinyu
Sun, Sining
Yang, Qing
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 18815 - 18823
[35] Consistency self-supervised learning method for robust automatic speech recognition
Gao, Changfeng
Cheng, Gaofeng
Zhang, Pengyuan
Shengxue Xuebao/Acta Acustica, 2023, 48 (03): : 578 - 587
[36] Self-Supervised Learning Based Phone-Fortified Speech Enhancement
Qiu, Yuanhang
Wang, Ruili
Singh, Satwinder
Ma, Zhizhong
Hou, Feng
INTERSPEECH 2021, 2021, : 211 - 215
[37] Gated Self-supervised Learning for Improving Supervised Learning
Fuadi, Erland Hillman
Ruslim, Aristo Renaldo
Wardhana, Putu Wahyu Kusuma
Yudistira, Novanto
2024 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI 2024, 2024, : 611 - 615
[38] Self-Supervised Dialogue Learning
Wu, Jiawei
Wang, Xin
Wang, William Yang
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3857 - 3867
[39] Self-supervised learning model
Saga, Kazushie
Sugasaka, Tamami
Sekiguchi, Minoru
Fujitsu Scientific and Technical Journal, 1993, 29 (03): : 209 - 216
[40] Longitudinal self-supervised learning
Zhao, Qingyu
Liu, Zixuan
Adeli, Ehsan
Pohl, Kilian M.
MEDICAL IMAGE ANALYSIS, 2021, 71

← 1 2 3 4 5 →