Data Augmentation Methods for End-to-end Speech Recognition on Distant-Talk Scenarios

被引:3
|
作者
Tsunoo, Emiru [1 ]
Shibata, Kentaro [1 ]
Narisetty, Chaitanya [2 ]
Kashiwagi, Yosuke [1 ]
Watanabe, Shinji [2 ]
机构
[1] Sony Corp, Tokyo, Japan
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
来源
关键词
speech recognition; data augmentation; RNN-transducer; text-to-speech; Cycle-GAN; label smoothing;
D O I
10.21437/Interspeech.2021-958
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Although end-to-end automatic speech recognition (E2E ASR) has achieved great performance in tasks that have numerous paired data, it is still challenging to make E2E ASR robust against noisy and low-resource conditions. In this study, we investigated data augmentation methods for E2E ASR in distant-talk scenarios. E2E ASR models are trained on the series of CHiME challenge datasets, which are suitable tasks for studying robustness against noisy and spontaneous speech. We propose to use three augmentation methods and thier combinations: 1) data augmentation using text-to-speech (TTS) data, 2) cycle-consistent generative adversarial network (Cycle-GAN) augmentation trained to map two different audio characteristics, the one of clean speech and of noisy recordings, to match the testing condition, and 3) pseudo-label augmentation provided by the pretrained ASR module for smoothing label distributions. Experimental results using the CHiME-6/CHiME-4 datasets show that each augmentation method individually improves the accuracy on top of the conventional SpecAugment; further improvements are obtained by combining these approaches. We achieved 4.3% word error rate (WER) reduction, which was more significant than that of the SpecAugment, when we combine all three augmentations for the CHiME-6 task.
引用
收藏
页码:301 / 305
页数:5
相关论文
共 50 条
  • [1] SUBBAND TEMPORAL ENVELOPE FEATURES AND DATA AUGMENTATION FOR END-TO-END RECOGNITION OF DISTANT CONVERSATIONAL SPEECH
    Do, Cong-Thanh
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6251 - 6255
  • [2] Data Augmentation for End-to-end Silent Speech Recognition for Laryngectomees
    Cao, Beiming
    Teplansky, Kristin
    Sebkhi, Nordine
    Bhaysar, Arpan
    Inan, Omer T.
    Samlan, Robin
    Mau, Ted
    Wang, Jun
    [J]. INTERSPEECH 2022, 2022, : 3653 - 3657
  • [3] Semantic Data Augmentation for End-to-End Mandarin Speech Recognition
    Sun, Jianwei
    Tang, Zhiyuan
    Yin, Hengxin
    Wang, Wei
    Zhao, Xi
    Zhao, Shuaijiang
    Lei, Xiaoning
    Zou, Wei
    Li, Xiangang
    [J]. INTERSPEECH 2021, 2021, : 1269 - 1273
  • [4] DATA AUGMENTATION FOR END-TO-END CODE-SWITCHING SPEECH RECOGNITION
    Du, Chenpeng
    Li, Hao
    Lu, Yizhou
    Wang, Lan
    Qian, Yanmin
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 194 - 200
  • [5] SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition
    Song, Xingchen
    Wu, Zhiyong
    Huang, Yiheng
    Su, Dan
    Meng, Helen
    [J]. INTERSPEECH 2020, 2020, : 581 - 585
  • [6] AUDITORY-BASED DATA AUGMENTATION FOR END-TO-END AUTOMATIC SPEECH RECOGNITION
    Tu, Zehai
    Deadman, Jack
    Ma, Ning
    Barker, Jon
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7447 - 7451
  • [7] CONVOLUTIONAL DROPOUT AND WORDPIECE AUGMENTATION FOR END-TO-END SPEECH RECOGNITION
    Xu, Hainan
    Huang, Yinghui
    Zhu, Yun
    Audhkhasi, Kartik
    Ramabhadran, Bhuvana
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5984 - 5988
  • [8] STARGAN FOR EMOTIONAL SPEECH CONVERSION: VALIDATED BY DATA AUGMENTATION OF END-TO-END EMOTION RECOGNITION
    Rizos, Georgios
    Baird, Alice
    Elliott, Max
    Schuller, Bjorn
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3502 - 3506
  • [9] Data Augmentation for End-to-End Optical Music Recognition
    Lopez-Gutierrez, Juan C.
    Valero-Mas, Jose J.
    Castellanos, Francisco J.
    Calvo-Zaragoza, Jorge
    [J]. DOCUMENT ANALYSIS AND RECOGNITION, ICDAR 2021 WORKSHOPS, PT I, 2021, 12916 : 59 - 73
  • [10] Utterance Confidence Measure for End-to-End Speech Recognition with Applications to Distributed Speech Recognition Scenarios
    Kumar, Ankur
    Singh, Sachin
    Gowda, Dhananjaya
    Garg, Abhinav
    Singh, Shatrughan
    Kim, Chanwoo
    [J]. INTERSPEECH 2020, 2020, : 4357 - 4361