DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

被引:0
|
作者
Liu, Alexander H. [1 ]
Chang, Heng-Jui [1 ]
Auli, Michael [2 ]
Hsu, Wei-Ning [2 ]
Glass, James [1 ]
机构
[1] MIT, CSAIL, Cambridge, MA 02139 USA
[2] Meta AI, New York, NY USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units. Code available at https://github.com/Alexander-H- Liu/dinosr.
引用
收藏
页数:17
相关论文
共 50 条
  • [21] Probabilistic online self-distillation
    Tzelepi, Maria
    Passalis, Nikolaos
    Tefas, Anastasios
    NEUROCOMPUTING, 2022, 493 : 592 - 604
  • [22] Self-Distilled Self-supervised Representation Learning
    Jang, Jiho
    Kim, Seonhoon
    Yoo, Kiyoon
    Kong, Chaerin
    Kim, Jangho
    Kwak, Nojun
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2828 - 2838
  • [23] Self-Supervised Representation Learning With Path Integral Clustering for Speaker Diarization
    Singh, Prachi
    Ganapathy, Sriram
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1639 - 1649
  • [24] Efficiency-oriented approaches for self-supervised speech representation learning
    Lugo, Luis
    Vielzeuf, Valentin
    International Journal of Speech Technology, 2024, 27 (03) : 765 - 779
  • [25] Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation
    Luo, Jian
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    INTERSPEECH 2021, 2021, : 1169 - 1173
  • [26] On the (In)Efficiency of Acoustic Feature Extractors for Self-Supervised Speech Representation Learning
    Parcollet, Titouan
    Zhang, Shucong
    Ramos, Alberto Gil C. P.
    van Dalen, Rogier
    Bhattacharya, Sourav
    INTERSPEECH 2023, 2023, : 581 - 585
  • [27] Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction
    Mu, Zhaoxi
    Yang, Xinyu
    Sun, Sining
    Yang, Qing
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 18815 - 18823
  • [28] EXPLORING THE INTEGRATION OF SPEECH SEPARATION AND RECOGNITION WITH SELF-SUPERVISED LEARNING REPRESENTATION
    Masuyama, Yoshiki
    Chang, Xuankai
    Zhang, Wangyou
    Cornell, Samuele
    Wang, Zhong-Qiu
    Ono, Nobutaka
    Qian, Yanmin
    Watanabe, Shinji
    2023 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, WASPAA, 2023,
  • [29] Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning
    Kim, Eesung
    Jeon, Jae-Jin
    Seo, Hyeji
    Kim, Hoon
    INTERSPEECH 2022, 2022, : 1411 - 1415
  • [30] Few-shot Learning with Online Self-Distillation
    Liu, Sihan
    Wang, Yue
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 1067 - 1070