You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation

被引:0
|
作者
Laptev, Aleksandr [1 ]
Korostik, Roman [1 ]
Svischev, Aleksey [1 ]
Andrusenko, Andrei [1 ]
Medennikov, Ivan [1 ,2 ]
Rybin, Sergey [1 ,2 ]
机构
[1] ITMO Univ, St Petersburg 197101, Russia
[2] STC Innovat Ltd, St Petersburg 194044, Russia
关键词
Speech Recognition; End-to-End; Speech Synthesis; Data Augmentation; ASR;
D O I
10.1109/cisp-bmei51763.2020.9263564
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data augmentation is one of the most effective ways to make end-to-end automatic speech recognition (ASR) perform close to the conventional hybrid approach, especially when dealing with low-resource tasks. Using recent advances in speech synthesis (text-to-speech, or TTS), we build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model. We argue that, when the training data amount is relatively low, this approach can allow an end-to-end model to reach hybrid systems' quality. For an artificial low-to-medium-resource setup, we compare the proposed augmentation with the semi-supervised learning technique. We also investigate the influence of vocoder usage on final ASR performance by comparing Griffin-Lim algorithm with our modified LPCNet. When applied with an external language model, our approach outperforms a semi-supervised setup for LibriSpeech test-clean and only 33% worse than a comparable supervised setup. Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
引用
收藏
页码:439 / 444
页数:6
相关论文
共 50 条
  • [1] Semantic Data Augmentation for End-to-End Mandarin Speech Recognition
    Sun, Jianwei
    Tang, Zhiyuan
    Yin, Hengxin
    Wang, Wei
    Zhao, Xi
    Zhao, Shuaijiang
    Lei, Xiaoning
    Zou, Wei
    Li, Xiangang
    [J]. INTERSPEECH 2021, 2021, : 1269 - 1273
  • [2] Data Augmentation for End-to-end Silent Speech Recognition for Laryngectomees
    Cao, Beiming
    Teplansky, Kristin
    Sebkhi, Nordine
    Bhaysar, Arpan
    Inan, Omer T.
    Samlan, Robin
    Mau, Ted
    Wang, Jun
    [J]. INTERSPEECH 2022, 2022, : 3653 - 3657
  • [3] On the Training and Testing Data Preparation for End-to-End Text-to-Speech Application
    Duc Chung Tran
    Khan, M. K. A. Ahamed
    Sridevi, S.
    [J]. 2020 11TH IEEE CONTROL AND SYSTEM GRADUATE RESEARCH COLLOQUIUM (ICSGRC), 2020, : 73 - 75
  • [4] Multitask Training with Text Data for End-to-End Speech Recognition
    Wang, Peidong
    Sainath, Tara N.
    Weiss, Ron J.
    [J]. INTERSPEECH 2021, 2021, : 2566 - 2570
  • [5] DATA AUGMENTATION FOR END-TO-END CODE-SWITCHING SPEECH RECOGNITION
    Du, Chenpeng
    Li, Hao
    Lu, Yizhou
    Wang, Lan
    Qian, Yanmin
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 194 - 200
  • [6] SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition
    Song, Xingchen
    Wu, Zhiyong
    Huang, Yiheng
    Su, Dan
    Meng, Helen
    [J]. INTERSPEECH 2020, 2020, : 581 - 585
  • [7] SEMI-SUPERVISED END-TO-END SPEECH RECOGNITION USING TEXT-TO-SPEECH AND AUTOENCODERS
    Karita, Shigeki
    Watanabe, Shinji
    Iwata, Tomoharu
    Delcroix, Marc
    Ogawa, Atsunori
    Nakatani, Tomohiro
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6166 - 6170
  • [8] Improvement of the end-To-end scene text recognition method for text-To-speech conversion
    Makhmudov, Fazliddin
    Mukhiddinov, Mukhriddin
    Abdusalomov, Akmalbek
    Avazov, Kuldoshbay
    Khamdamov, Utkir
    Cho, Young Im
    [J]. Cho, Young Im (yicho@gachon.ac.kr), 1600, World Scientific (18):
  • [9] End-to-End Mongolian Text-to-Speech System
    Li, Jingdong
    Zhang, Hui
    Liu, Rui
    Zhang, Xueliang
    Bao, Feilong
    [J]. 2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 483 - 487
  • [10] Improvement of the end-to-end scene text recognition method for "text-to-speech" conversion
    Makhmudov, Fazliddin
    Mukhiddinov, Mukhriddin
    Abdusalomov, Akmalbek
    Avazov, Kuldoshbay
    Khamdamov, Utkir
    Cho, Young Im
    [J]. INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2020, 18 (06)