Direct enhancement of pre-trained speech embeddings for speech processing in noisy conditions

被引:1
|
作者
Ali, Mohamed Nabih [1 ,2 ]
Brutti, Alessio [2 ]
Falavigna, Daniele
机构
[1] Univ Trento, IECS Doctoral Sch, Trento, Italy
[2] Fdn Bruno Kessler, Digital Soc Ctr, Trento, Italy
来源
关键词
Speech enhancement; Automatic speech recognition; Speech embedding; Speech classification; NETWORK;
D O I
10.1016/j.csl.2023.101501
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Lately, the development of deep learning algorithms has marked milestones in the field of speech processing. In particular, the release of pre-trained feature extraction models has considerably simplified the development of speech classification and recognition algorithms. However, environmental noise and reverberation still negatively affect the whole performance, making robustness in noisy conditions mandatory in real-world applications. One way to mitigate the noise effect is to integrate a speech enhancement front-end that removes artifacts from the desired speech signals. Unlike the state-of-the-art enhancement approaches that operate either on speech spectrogram or directly on time-domain signals, in this paper, we study how enhancement can be applied directly on the speech embeddings, extracted using Wav2Vec and WavLM models. Moreover, we investigate a variety of training approaches, considering different flavors of joint and disjoint training of the speech enhancement front-end with the classification/recognition back-end. We perform exhaustive experiments on the Fluent Speech Commands and Google Speech Commands datasets contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, as well as on the LibriSpeech dataset, contaminated with noises from the MUSAN dataset, considering intent classification, keyword spotting, and speech recognition tasks respectively. Results show that directly enhancing the speech embedding is a viable, computationally effective approach, and provide insights about the most promising training approaches.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Pre-trained Text Embeddings for Enhanced Text-to-Speech Synthesis
    Hayashi, Tomoki
    Watanabe, Shinji
    Toda, Tomoki
    Takeda, Kazuya
    Toshniwal, Shubham
    Livescu, Karen
    [J]. INTERSPEECH 2019, 2019, : 4430 - 4434
  • [2] Interpretabilty of Speech Emotion Recognition modelled using Self-Supervised Speech and Text Pre-Trained Embeddings
    Girish, K. V. Vijay
    Konjeti, Srikanth
    Vepa, Jithendra
    [J]. INTERSPEECH 2022, 2022, : 4496 - 4500
  • [3] Processing Noisy Speech for Enhancement
    Krishnamoorthy, P.
    Prasanna, Mahadeva
    [J]. IETE TECHNICAL REVIEW, 2007, 24 (05) : 351 - 357
  • [4] Enhancing Embeddings for Speech Classification in Noisy Conditions
    Ali, Mohamed Nabih
    Falavigna, Daniele
    Brutti, Alessio
    [J]. INTERSPEECH 2022, 2022, : 2933 - 2937
  • [5] Alzheimer Disease Recognition Using Speech-Based Embeddings From Pre-Trained Models
    Gauder, Lara
    Pepino, Leonardo
    Ferrer, Luciana
    Riera, Pablo
    [J]. INTERSPEECH 2021, 2021, : 3795 - 3799
  • [6] Speaker Anonymization: Disentangling Speaker Features from Pre-Trained Speech Embeddings for Voice Conversion
    Matassoni, Marco
    Fong, Seraphina
    Brutti, Alessio
    [J]. APPLIED SCIENCES-BASEL, 2024, 14 (09):
  • [7] Speech Enhancement Using Self-Supervised Pre-Trained Model and Vector Quantization
    Zhao, Xiao-Ying
    Zhu, Qiu-Shi
    Zhang, Jie
    [J]. PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 330 - 334
  • [8] Learning time-frequency mask for noisy speech enhancement using gaussian-bernoulli pre-trained deep neural networks
    Saleem, Nasir
    Khattak, Muhammad Irfan
    Al-Hasan, Mu'ath
    Jan, Atif
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 40 (01) : 849 - 864
  • [9] Enhancement of noisy speech by temporal and spectral processing
    Krishnamoorthy, P.
    Prasanna, S. R. M.
    [J]. SPEECH COMMUNICATION, 2011, 53 (02) : 154 - 174
  • [10] Debiasing Pre-trained Contextualised Embeddings
    Kaneko, Masahiro
    Bollegala, Danushka
    [J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1256 - 1266