Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance

被引：0

作者：

Perezhohin, Yuriy ^{[1
,2
]}

Santos, Tiago ^{[1
,2
]}

Costa, Victor ^{[1
,2
]}

Peres, Fernando ^{[1
]}

Castelli, Mauro ^{[2
]}

机构：

[1] MyNorth AI Res, P-2780125 Oeiras, Portugal

[2] Univ NOVA Lisboa, NOVA Informat Management Sch NOVA IMS, Campus Campolide, P-1070312 Lisbon, Portugal

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Hidden Markov models; Feature extraction; Filtering; Data models; Synthetic data; Training; Contrastive learning; Accuracy; Adaptation models; Transformers; Automatic speech recognition; Text to speech; contrastive learning; data augmentation; embeddings; synthetic data filtering; text-to-speech; REPRESENTATIONS;

D O I：

10.1109/ACCESS.2024.3482970

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper presents a novel methodology for enhancing Automatic Speech Recognition (ASR) performance by utilizing contrastive learning to filter synthetic audio data. We address the challenge of incorporating synthetic data into ASR training, especially in scenarios with limited real-world data or unique linguistic characteristics. The method utilizes a contrastive learning model to align representations of synthetic audio and its corresponding text transcripts, enabling the identification and removal of low-quality samples that do not align well semantically. We evaluate the methodology on a medium-resource language across two distinct datasets: a general-domain dataset and a regionally specific dataset characterized by unique pronunciation patterns. Experimental results reveal that the optimal filtering strategy depends on both model capacity and dataset characteristics. Larger models, like Whisper Large V3, particularly benefit from aggressive filtering, while smaller models may not require such stringent filtering, especially on non-normalized text. This work highlights the importance of adjusting synthetic data augmentation and filtering to specific model architectures and target domains. The proposed method, robust and adaptable, enhances ASR performance across diverse language settings. We have open-sourced the entire work, which includes 140 hours of synthetically generated Portuguese speech, as well as the pipeline and parameter settings used to create these samples. Additionally, we provide the fine-tuned Whisper models and the code required to reproduce this research. Our code will be available at https://github.com/my-north-ai/semantic_audio_filtering.

引用

页码：155136 / 155150

页数：15

共 50 条

[41] Critique: The potential role of speech production models in automatic speech recognition
Moore, RK
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1996, 99 (03): : 1710 - 1713
[42] Enhancing Automatic Speech Recognition With Personalized Models: Improving Accuracy Through Individualized Fine-Tuning
Brydinskyi, Vitalii
Sabodashko, Dmytro
Khoma, Yuriy
Podpora, Michal
Konovalov, Alexander
Khoma, Volodymyr
IEEE ACCESS, 2024, 12 : 116649 - 116656
[43] Temporal Filtering of Visual Speech for Audio-Visual Speech Recognition in Acoustically and Visually Challenging Environments
Lee, Jong-Seok
Park, Cheol Hoon
ICMI'07: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, 2007, : 220 - 227
[44] Dereverberation based on Wavelet Packet Filtering for Robust Automatic Speech Recognition
Gomez, Randy
Kawahara, Tatsuya
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 1242 - 1245
[45] Automatic speech recognition using hidden Markov models
Botros, N.M.
Teh, C.K.
Microcomputer Applications, 1994, 13 (01): : 6 - 12
[46] Neuro-fuzzy filtering techniques for automatic speech recognition enhancement
Poluzzi, R
Arnone, L
Savi, A
Brescianini, M
2003 IEEE INTERNATIONAL SYMPOSIUM ON INTELLIGENT SIGNAL PROCESSING, PROCEEDINGS: FROM CLASSICAL MEASUREMENT TO COMPUTING WITH PERCEPTIONS, 2003, : 255 - 258
[47] Production models as a structural basis for automatic speech recognition
Deng, L
Ramsay, G
Sun, D
SPEECH COMMUNICATION, 1997, 22 (2-3) : 93 - 111
[48] Using semantic analysis to improve speech recognition performance
Erdogan, H
Sarikaya, R
Chen, SF
Gao, YQ
Picheny, M
COMPUTER SPEECH AND LANGUAGE, 2005, 19 (03): : 321 - 343
[49] DYNAMICALLY WEIGHTED ENSEMBLE MODELS FOR AUTOMATIC SPEECH RECOGNITION
Praveen, Kiran
Pandey, Abhishek
Kumar, Deepak
Rath, Shakti Prasad
Bapat, Sandip Shriram
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 111 - 116
[50] JOINT LANGUAGE MODELS FOR AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING
Bayer, Ali Orkan
Riccardi, Giuseppe
2012 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2012), 2012, : 199 - 203

← 1 2 3 4 5 →