Self-Supervised Speech Representation Learning: A Review

被引:56
|
作者
Mohamed, Abdelrahman [1 ]
Lee, Hung-yi [2 ,3 ]
Borgholt, Lasse [4 ,5 ]
Havtorn, Jakob D. [6 ,7 ]
Edin, Joakim [6 ]
Igel, Christian [5 ]
Kirchhoff, Katrin [8 ]
Li, Shang-Wen [1 ]
Livescu, Karen [9 ]
Maaloe, Lars [6 ,7 ]
Sainath, Tara N. [10 ]
Watanabe, Shinji [11 ]
机构
[1] Meta, Menlo Pk, CA 94025 USA
[2] Natl Taiwan Univ, Dept Elect Engn, Taipei 10617, Taiwan
[3] Natl Taiwan Univ, Dept Comp Sci Informat Engn, Taipei 10617, Taiwan
[4] Univ Copenhagen, Corti AI, DK-1165 Copenhagen, Denmark
[5] Univ Copenhagen, Dept Comp Sci, DK-1165 Copenhagen, Denmark
[6] Tech Univ Denmark, Corti AI, DK-2800 Lyngby, Denmark
[7] Tech Univ Denmark, Dept Appl Math & Comp Sci, DK-2800 Lyngby, Denmark
[8] Amazon, AWS AI Labs, Seattle, WA 98121 USA
[9] Toyota Technol Inst, Chicago, IL 60615 USA
[10] Google Inc, New York, NY USA
[11] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
关键词
Task analysis; Hidden Markov models; Data models; Representation learning; Training; Speech processing; Self-supervised learning; speech representations; DEEP NEURAL-NETWORKS; WORD EMBEDDINGS; SPOKEN LANGUAGE; DATA AUGMENTATION; MODEL; RECOGNITION; FRAMEWORK; AUTOENCODERS; ATTENTION; EXPERTS;
D O I
10.1109/JSTSP.2022.3207050
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available. Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Such methods have shown success in natural language processing and computer vision domains, achieving new levels of performance while reducing the number of labels required for many downstream scenarios. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. Other approaches rely on multi-modal data for pre-training, mixing text or visual data streams with speech. Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years. This review presents approaches for self-supervised speech representation learning and their connection to other research areas. Since many current methods focus solely on automatic speech recognition as a downstream task, we review recent efforts on benchmarking learned representations to extend the application beyond speech recognition.
引用
收藏
页码:1179 / 1210
页数:32
相关论文
共 50 条
  • [21] Adaptive Self-Supervised Graph Representation Learning
    Gong, Yunchi
    [J]. 36TH INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING (ICOIN 2022), 2022, : 254 - 259
  • [22] Self-Supervised Relational Reasoning for Representation Learning
    Patacchiola, Massimiliano
    Storkey, Amos
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [23] Self-Supervised Learning for Specified Latent Representation
    Liu, Chicheng
    Song, Libin
    Zhang, Jiwen
    Chen, Ken
    Xu, Jing
    [J]. IEEE TRANSACTIONS ON FUZZY SYSTEMS, 2020, 28 (01) : 47 - 59
  • [24] Self-supervised Representation Learning on Document Images
    Cosma, Adrian
    Ghidoveanu, Mihai
    Panaitescu-Liess, Michael
    Popescu, Marius
    [J]. DOCUMENT ANALYSIS SYSTEMS, 2020, 12116 : 103 - 117
  • [25] Distilling Localization for Self-Supervised Representation Learning
    Zhao, Nanxuan
    Wu, Zhirong
    Lau, Rynson W. H.
    Lin, Stephen
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 10990 - 10998
  • [26] Context Autoencoder for Self-supervised Representation Learning
    Chen, Xiaokang
    Ding, Mingyu
    Wang, Xiaodi
    Xin, Ying
    Mo, Shentong
    Wang, Yunhao
    Han, Shumin
    Luo, Ping
    Zeng, Gang
    Wang, Jingdong
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 132 (1) : 208 - 223
  • [27] SELF-SUPERVISED REPRESENTATION LEARNING FOR ULTRASOUND VIDEO
    Jiao, Jianbo
    Droste, Richard
    Drukker, Lior
    Papageorghiou, Aris T.
    Noble, J. Alison
    [J]. 2020 IEEE 17TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2020), 2020, : 1847 - 1850
  • [28] Context Autoencoder for Self-supervised Representation Learning
    Xiaokang Chen
    Mingyu Ding
    Xiaodi Wang
    Ying Xin
    Shentong Mo
    Yunhao Wang
    Shumin Han
    Ping Luo
    Gang Zeng
    Jingdong Wang
    [J]. International Journal of Computer Vision, 2024, 132 : 208 - 223
  • [29] Self-supervised Representation Learning for Astronomical Images
    Hayat, Md Abul
    Stein, George
    Harrington, Peter
    Lukic, Zarija
    Mustafa, Mustafa
    [J]. ASTROPHYSICAL JOURNAL LETTERS, 2021, 911 (02)
  • [30] Self-supervised representation learning for trip recommendation
    Gao, Qiang
    Wang, Wei
    Zhang, Kunpeng
    Yang, Xin
    Miao, Congcong
    Li, Tianrui
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 247