Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation

被引:0
|
作者
Niizumi, Daisuke [1 ]
Takeuchi, Daiki [1 ]
Ohishi, Yasunori [1 ]
Harada, Noboru [1 ]
Kashino, Kunio [1 ]
机构
[1] NTT Corp, Tokyo, Japan
来源
关键词
speech representation learning; general-purpose audio representation; denoising; distillation; specialization;
D O I
10.21437/Interspeech.2023-221
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Self-supervised learning general-purpose audio representations have demonstrated high performance in a variety of tasks. Although they can be optimized for application by fine-tuning, even higher performance can be expected if they can be specialized to pre-train for an application. This paper explores the challenges and solutions in specializing general-purpose audio representations for a specific application using speech, a highly demanding field, as an example. We enhance Masked Modeling Duo (M2D), a general-purpose model, to close the performance gap with state-of-the-art (SOTA) speech models. To do so, we propose a new task, denoising distillation, to learn from fine-grained clustered features, and M2D for Speech (M2D-S), which jointly learns the denoising distillation task and M2D masked prediction task. Experimental results show that M2D-S performs comparably to or outperforms SOTA speech models on the SUPERB benchmark, demonstrating that M2D can specialize in a demanding field.
引用
收藏
页码:1294 / 1298
页数:5
相关论文
共 50 条
  • [21] Feature Denoising Using Joint Sparse Representation for In-Car Speech Recognition
    Li, Weifeng
    Zhou, Yicong
    Poh, Norman
    Zhou, Fei
    Liao, Qingmin
    IEEE SIGNAL PROCESSING LETTERS, 2013, 20 (07) : 681 - 684
  • [22] VatLM: Visual-Audio-Text Pre-Training With Unified Masked Prediction for Speech Representation Learning
    Zhu, Qiushi
    Zhou, Long
    Zhang, Ziqiang
    Liu, Shujie
    Jiao, Binxing
    Zhang, Jie
    Dai, Lirong
    Jiang, Daxin
    Li, Jinyu
    Wei, Furu
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1055 - 1064
  • [23] Audio and Speech Compression using Sinusoidal Modeling and Wavelet Residuum Coding
    Nagy, Martin Turi
    Vargic, Radoslav
    PROCEEDINGS ELMAR-2012, 2012, : 207 - 210
  • [24] Experiments on speech tracking in audio documents using gaussian mixture modeling
    Seck, M
    Magrin-Chagnolleau, I
    Bimbot, F
    2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING, 2001, : 601 - 604
  • [25] Redundant representation of acoustic signals using curvelet transform and its application to speech denoising
    Chiba, Mariko
    Yatabe, Kohei
    Oikawa, Yasuhiro
    ACOUSTICAL SCIENCE AND TECHNOLOGY, 2015, 36 (05) : 457 - 458
  • [26] Automatic liver segmentation in computed tomography using general-purpose shape modeling methods
    Spinczyk, Dominik
    Krason, Agata
    BIOMEDICAL ENGINEERING ONLINE, 2018, 17
  • [27] Automatic liver segmentation in computed tomography using general-purpose shape modeling methods
    Dominik Spinczyk
    Agata Krasoń
    BioMedical Engineering OnLine, 17
  • [28] A Parameter-Efficient Learning Approach to Arabic Dialect Identification with Pre-Trained General-Purpose Speech Model
    Radhakrishnan, Srijith
    Yang, Chao-Han Huck
    Khan, Sumeer Ahmad
    Kiani, Narsis A.
    Gomez-Cabrero, David
    Tegner, Jesper N.
    INTERSPEECH 2023, 2023, : 1958 - 1962
  • [29] Audio-visual speech modeling using coupled hidden Markov models
    Chu, SM
    Huang, TS
    2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2009 - 2012
  • [30] Sinusoidal modeling of audio and speech using psychoacoustic-adaptive matching pursuits
    Heusdens, R
    Vafin, R
    Kleijn, WB
    2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING, 2001, : 3281 - 3284