Self-Supervised Speech Representation Learning: A Review

被引:56
|
作者
Mohamed, Abdelrahman [1 ]
Lee, Hung-yi [2 ,3 ]
Borgholt, Lasse [4 ,5 ]
Havtorn, Jakob D. [6 ,7 ]
Edin, Joakim [6 ]
Igel, Christian [5 ]
Kirchhoff, Katrin [8 ]
Li, Shang-Wen [1 ]
Livescu, Karen [9 ]
Maaloe, Lars [6 ,7 ]
Sainath, Tara N. [10 ]
Watanabe, Shinji [11 ]
机构
[1] Meta, Menlo Pk, CA 94025 USA
[2] Natl Taiwan Univ, Dept Elect Engn, Taipei 10617, Taiwan
[3] Natl Taiwan Univ, Dept Comp Sci Informat Engn, Taipei 10617, Taiwan
[4] Univ Copenhagen, Corti AI, DK-1165 Copenhagen, Denmark
[5] Univ Copenhagen, Dept Comp Sci, DK-1165 Copenhagen, Denmark
[6] Tech Univ Denmark, Corti AI, DK-2800 Lyngby, Denmark
[7] Tech Univ Denmark, Dept Appl Math & Comp Sci, DK-2800 Lyngby, Denmark
[8] Amazon, AWS AI Labs, Seattle, WA 98121 USA
[9] Toyota Technol Inst, Chicago, IL 60615 USA
[10] Google Inc, New York, NY USA
[11] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
关键词
Task analysis; Hidden Markov models; Data models; Representation learning; Training; Speech processing; Self-supervised learning; speech representations; DEEP NEURAL-NETWORKS; WORD EMBEDDINGS; SPOKEN LANGUAGE; DATA AUGMENTATION; MODEL; RECOGNITION; FRAMEWORK; AUTOENCODERS; ATTENTION; EXPERTS;
D O I
10.1109/JSTSP.2022.3207050
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available. Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Such methods have shown success in natural language processing and computer vision domains, achieving new levels of performance while reducing the number of labels required for many downstream scenarios. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. Other approaches rely on multi-modal data for pre-training, mixing text or visual data streams with speech. Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years. This review presents approaches for self-supervised speech representation learning and their connection to other research areas. Since many current methods focus solely on automatic speech recognition as a downstream task, we review recent efforts on benchmarking learned representations to extend the application beyond speech recognition.
引用
收藏
页码:1179 / 1210
页数:32
相关论文
共 50 条
  • [1] Self-Supervised Learning With Segmental Masking for Speech Representation
    Yue, Xianghu
    Lin, Jingru
    Gutierrez, Fabian Ritter
    Li, Haizhou
    [J]. IEEE Journal on Selected Topics in Signal Processing, 2022, 16 (06): : 1367 - 1379
  • [2] Self-Supervised Learning With Segmental Masking for Speech Representation
    Yue, Xianghu
    Lin, Jingru
    Gutierrez, Fabian Ritter
    Li, Haizhou
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1367 - 1379
  • [3] Phonetically Motivated Self-Supervised Speech Representation Learning
    Yue, Xianghu
    Li, Haizhou
    [J]. INTERSPEECH 2021, 2021, : 746 - 750
  • [4] TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech
    Liu, Andy T.
    Li, Shang-Wen
    Lee, Hung-yi
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2351 - 2366
  • [5] Clustering and Retraining Based Self-Supervised Speech Representation Learning Method
    Zhang, Wenlin
    Liu, Xuepeng
    Niu, Tong
    Yang, Xukui
    Qu, Dan
    [J]. Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2022, 35 (05): : 461 - 471
  • [6] Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation
    Luo, Jian
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    [J]. INTERSPEECH 2021, 2021, : 1169 - 1173
  • [7] On the (In)Efficiency of Acoustic Feature Extractors for Self-Supervised Speech Representation Learning
    Parcollet, Titouan
    Zhang, Shucong
    Ramos, Alberto Gil C. P.
    van Dalen, Rogier
    Bhattacharya, Sourav
    [J]. INTERSPEECH 2023, 2023, : 581 - 585
  • [8] Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction
    Mu, Zhaoxi
    Yang, Xinyu
    Sun, Sining
    Yang, Qing
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 18815 - 18823
  • [9] EXPLORING THE INTEGRATION OF SPEECH SEPARATION AND RECOGNITION WITH SELF-SUPERVISED LEARNING REPRESENTATION
    Masuyama, Yoshiki
    Chang, Xuankai
    Zhang, Wangyou
    Cornell, Samuele
    Wang, Zhong-Qiu
    Ono, Nobutaka
    Qian, Yanmin
    Watanabe, Shinji
    [J]. 2023 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, WASPAA, 2023,
  • [10] Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning
    Kim, Eesung
    Jeon, Jae-Jin
    Seo, Hyeji
    Kim, Hoon
    [J]. INTERSPEECH 2022, 2022, : 1411 - 1415