Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance

被引:0
|
作者
Perezhohin, Yuriy [1 ,2 ]
Santos, Tiago [1 ,2 ]
Costa, Victor [1 ,2 ]
Peres, Fernando [1 ]
Castelli, Mauro [2 ]
机构
[1] MyNorth AI Res, P-2780125 Oeiras, Portugal
[2] Univ NOVA Lisboa, NOVA Informat Management Sch NOVA IMS, Campus Campolide, P-1070312 Lisbon, Portugal
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Hidden Markov models; Feature extraction; Filtering; Data models; Synthetic data; Training; Contrastive learning; Accuracy; Adaptation models; Transformers; Automatic speech recognition; Text to speech; contrastive learning; data augmentation; embeddings; synthetic data filtering; text-to-speech; REPRESENTATIONS;
D O I
10.1109/ACCESS.2024.3482970
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a novel methodology for enhancing Automatic Speech Recognition (ASR) performance by utilizing contrastive learning to filter synthetic audio data. We address the challenge of incorporating synthetic data into ASR training, especially in scenarios with limited real-world data or unique linguistic characteristics. The method utilizes a contrastive learning model to align representations of synthetic audio and its corresponding text transcripts, enabling the identification and removal of low-quality samples that do not align well semantically. We evaluate the methodology on a medium-resource language across two distinct datasets: a general-domain dataset and a regionally specific dataset characterized by unique pronunciation patterns. Experimental results reveal that the optimal filtering strategy depends on both model capacity and dataset characteristics. Larger models, like Whisper Large V3, particularly benefit from aggressive filtering, while smaller models may not require such stringent filtering, especially on non-normalized text. This work highlights the importance of adjusting synthetic data augmentation and filtering to specific model architectures and target domains. The proposed method, robust and adaptable, enhances ASR performance across diverse language settings. We have open-sourced the entire work, which includes 140 hours of synthetically generated Portuguese speech, as well as the pipeline and parameter settings used to create these samples. Additionally, we provide the fine-tuned Whisper models and the code required to reproduce this research. Our code will be available at https://github.com/my-north-ai/semantic_audio_filtering.
引用
收藏
页码:155136 / 155150
页数:15
相关论文
共 50 条
  • [21] An automatic multimodal speech recognition system with audio and video information
    A. A. Karpov
    Automation and Remote Control, 2014, 75 : 2190 - 2200
  • [22] Enhancing the performance of subband audio coders for speech signals
    Malvar, H
    ISCAS '98 - PROCEEDINGS OF THE 1998 INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS 1-6, 1998, : D98 - D101
  • [23] Whispered Speech Recognition Based on Audio Data Augmentation and Inverse Filtering
    Galic, Jovan
    Markovic, Branko
    Grozdic, Dorde
    Popovic, Branislav
    Sajic, Slavko
    APPLIED SCIENCES-BASEL, 2024, 14 (18):
  • [24] Automatic Speech Recognition Performance for Training on Noised Speech
    Prodeus, Arkadiy
    Kukharicheva, Kateryna
    2017 2ND IEEE INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION AND COMMUNICATION TECHNOLOGIES-2017 (AICT 2017), 2017, : 71 - 74
  • [25] On the predictive connectionist models for automatic speech recognition
    Petek, B
    2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI, 2000, : 3442 - 3445
  • [26] AUTOMATIC SPEECH RECOGNITION USING PSYCHOACOUSTIC MODELS
    ZWICKER, E
    TERHARDT, E
    PAULUS, E
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1979, 65 (02): : 487 - 498
  • [27] A Survey of Multilingual Models for Automatic Speech Recognition
    Yadav, Hemant
    Sitaram, Sunayana
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 5071 - 5079
  • [28] Canonical State Models for Automatic Speech Recognition
    Gales, M. J. F.
    Yu, K.
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 58 - 61
  • [29] GEOGRAPHIC LANGUAGE MODELS FOR AUTOMATIC SPEECH RECOGNITION
    Xiao, Xiaoqiang
    Chen, Hong
    Zylak, Mark
    Sosa, Daniela
    Desu, Suma
    Krishnamoorthy, Mahesh
    Liu, Daben
    Paulik, Matthias
    Zhang, Yuchen
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6124 - 6128
  • [30] The potential role of speech production models in automatic speech recognition
    Rose, RC
    Schroeter, J
    Sondhi, MM
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1996, 99 (03): : 1699 - 1709