Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation

被引:0
|
作者
Niizumi, Daisuke [1 ]
Takeuchi, Daiki [1 ]
Ohishi, Yasunori [1 ]
Harada, Noboru [1 ]
Kashino, Kunio [1 ]
机构
[1] NTT Corp, Tokyo, Japan
来源
关键词
speech representation learning; general-purpose audio representation; denoising; distillation; specialization;
D O I
10.21437/Interspeech.2023-221
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Self-supervised learning general-purpose audio representations have demonstrated high performance in a variety of tasks. Although they can be optimized for application by fine-tuning, even higher performance can be expected if they can be specialized to pre-train for an application. This paper explores the challenges and solutions in specializing general-purpose audio representations for a specific application using speech, a highly demanding field, as an example. We enhance Masked Modeling Duo (M2D), a general-purpose model, to close the performance gap with state-of-the-art (SOTA) speech models. To do so, we propose a new task, denoising distillation, to learn from fine-grained clustered features, and M2D for Speech (M2D-S), which jointly learns the denoising distillation task and M2D masked prediction task. Experimental results show that M2D-S performs comparably to or outperforms SOTA speech models on the SUPERB benchmark, demonstrating that M2D can specialize in a demanding field.
引用
收藏
页码:1294 / 1298
页数:5
相关论文
共 50 条
  • [1] Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation
    Niizumi, Daisuke
    Takeuchi, Daiki
    Ohishi, Yasunori
    Harada, Noboru
    Kashino, Kunio
    HEAR: HOLISTIC EVALUATION OF AUDIO REPRESENTATIONS, VOL 166, 2021, 166 : 1 - 24
  • [2] A general-purpose IsiZulu speech synthesizer
    Louw, J. A.
    Davel, M.
    Barnard, E.
    SOUTH AFRICAN JOURNAL OF AFRICAN LANGUAGES, 2005, 25 (02) : 92 - 100
  • [3] General-Purpose Monitoring during Speech Production
    Ries, Stephanie
    Janssen, Niels
    Dufau, Stephane
    Alario, F. -Xavier
    Burle, Boris
    JOURNAL OF COGNITIVE NEUROSCIENCE, 2011, 23 (06) : 1419 - 1436
  • [4] General-Purpose Lithuanian Automatic Speech Recognition System
    Salimbajevs, Askars
    Kapociute-Dzikiene, Jurgita
    HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE, BALTIC HLT 2018, 2018, 307 : 150 - 157
  • [5] BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation
    Niizumi, Daisuke
    Takeuchi, Daiki
    Ohishi, Yasunori
    Harada, Noboru
    Kashino, Kunio
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [6] SINGLE-BOARD GENERAL-PURPOSE SPEECH RECOGNITION SYSTEM
    ACKENHUSEN, JG
    ALI, SS
    BISHOP, D
    ROSA, LF
    THORKILDSEN, R
    AT&T TECHNICAL JOURNAL, 1986, 65 (05): : 48 - 59
  • [7] Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder
    Kim, Seung-Bin
    Lee, Sang-Hoon
    Choi, Ha-Yeong
    Lee, Seong-Whan
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1012 - 1022
  • [8] SELF-SUPERVISED LEARNING METHOD USING MULTIPLE SAMPLING STRATEGIES FOR GENERAL-PURPOSE AUDIO REPRESENTATION
    Kuroyanagi, Ibuki
    Komatsu, Tatsuya
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 3263 - 3267
  • [9] General-purpose Adversarial Training for Enhanced Automatic Speech Recognition Model Generalization
    Kim, Dohee
    Shim, Daeyeol
    Chang, Joon-Hyuk
    INTERSPEECH 2023, 2023, : 889 - 893
  • [10] Modeling in the bioimpedance measurement techniques using general-purpose software
    Paavle, Toivo
    2006 INTERNATIONAL BALTIC ELECTRONICS CONFERENCE, PROCEEDINGS, 2006, : 209 - 212