Using Data Augmentation and Time-Scale Modification to Improve ASR of Children's Speech in Noisy Environments

被引:4
|
作者
Kathania, Hemant Kumar [1 ,2 ]
Kadiri, Sudarsana Reddy [1 ]
Alku, Paavo [1 ]
Kurimo, Mikko [1 ]
机构
[1] Aalto Univ, Dept Signal Proc & Acoust, Otakaari 3, FI-00076 Espoo, Finland
[2] Natl Inst Technol Sikkim, Dept Elect & Commun Engn, Ravangla 737139, India
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 18期
基金
芬兰科学院;
关键词
recognition of children's speech; data augmentation; time-scale modification; DNN; SIGNAL ESTIMATION; RECOGNITION; INFANTS;
D O I
10.3390/app11188420
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Current ASR systems show poor performance in recognition of children's speech in noisy environments because recognizers are typically trained with clean adults' speech and therefore there are two mismatches between training and testing phases (i.e., clean speech in training vs. noisy speech in testing and adult speech in training vs. child speech in testing). This article studies methods to tackle the effects of these two mismatches in recognition of noisy children's speech by investigating two techniques: data augmentation and time-scale modification. In the former, clean training data of adult speakers are corrupted with additive noise in order to obtain training data that better correspond to the noisy testing conditions. In the latter, the fundamental frequency (F-0) and speaking rate of children's speech are modified in the testing phase in order to reduce differences in the prosodic characteristics between the testing data of child speakers and the training data of adult speakers. A standard ASR system based on DNN-HMM was built and the effects of data augmentation, F-0 modification, and speaking rate modification on word error rate (WER) were evaluated first separately and then by combining all three techniques. The experiments were conducted using children's speech corrupted with additive noise of four different noise types in four different signal-to-noise (SNR) categories. The results show that the combination of all three techniques yielded the best ASR performance. As an example, the WER value averaged over all four noise types in the SNR category of 5 dB dropped from 32.30% to 12.09% when the baseline system, in which no data augmentation or time-scale modification were used, was replaced with a recognizer that was built using a combination of all three techniques. In summary, in recognizing noisy children's speech with ASR systems trained with clean adult speech, considerable improvements in the recognition performance can be achieved by combining data augmentation based on noise addition in the system training phase and time-scale modification based on modifying F-0 and speaking rate of children's speech in the testing phase.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Speech Time-Scale Modification With GANs
    Cohen, Eyal
    Kreuk, Felix
    Keshet, Joseph
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1067 - 1071
  • [2] Time-scale modification of speech signals
    Ninness, Brett
    Henriksen, Soren John
    IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2008, 56 (04) : 1479 - 1488
  • [3] Spectral Modification Based Data Augmentation for Improving End-to-End ASR for Children's Speech
    Singh, Vishwanath Pratap
    Sailor, Hardik
    Bhattacharya, Supratik
    Pandey, Abhishek
    INTERSPEECH 2022, 2022, : 3213 - 3217
  • [4] Variable time-scale modification of speech using transient information
    Lee, SJ
    Kim, HD
    Kim, HS
    1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS, 1997, : 1319 - 1322
  • [5] Data embedding in audio using time-scale modification
    Mansour, MF
    Tewfik, AH
    IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2005, 13 (03): : 432 - 440
  • [6] Shape invariant time-scale modification of speech using a harmonic model
    O'Brien, Darragh
    Monaghan, Alex
    ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 1999, 1 : 381 - 384
  • [7] Shape invariant time-scale modification of speech using a harmonic model
    O'Brien, D
    Monaghan, A
    ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, : 381 - 384
  • [8] Time-scale modification of speech signals, for language-learning impaired children
    Erogul, O
    Karagoz, I
    PROCEEDINGS OF THE 1998 2ND INTERNATIONAL CONFERENCE BIOMEDICAL ENGINEERING DAYS, 1998, : 33 - 35
  • [9] A simple hybrid approach to the time-scale modification of speech
    Knox, D
    Bailey, N
    Stewart, I
    JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 2005, 53 (7-8): : 612 - 619
  • [10] A simple hybrid approach to the time-scale modification of speech
    Knox, D. (D.Knox@gcal.ac.uk), 1600, Audio Engineering Society, 60 East 42nd Street, New York, NY 10165-0075, United States (53): : 7 - 8