Improving Speech Recognition using GAN-based Speech Synthesis and Contrastive Unspoken Text Selection

被引:16
|
作者
Chen, Zhehuai [1 ]
Rosenberg, Andrew [1 ]
Zhang, Yu [1 ]
Wang, Gary [1 ,2 ]
Ramabhadran, Bhuvana [1 ]
Moreno, Pedro J. [1 ]
机构
[1] Google, Mountain View, CA 94043 USA
[2] Simon Fraser Univ, Burnaby, BC, Canada
来源
关键词
Speech Synthesis; Speech Recognition; Generative Adversarial Network; Contrastive Data Selection;
D O I
10.21437/Interspeech.2020-1475
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Text-to-Speech synthesis (TTS) based data augmentation is a relatively new mechanism for utilizing text-only data to improve automatic speech recognition (ASR) training without parameter or inference architecture changes. However, efforts to train speech recognition systems on synthesized utterances suffer from limited acoustic diversity of TTS outputs. Additionally, the text-only corpus is always much larger than the transcribed speech corpus by several orders of magnitude, which makes speech synthesis of all the text data impractical. In this work, we propose to combine generative adversarial network (GAN) and multi-style training (MTR) to increase acoustic diversity in the synthesized data. We also present a contrastive language model-based data selection technique to improve the efficiency of learning from unspoken text. We demonstrate that our proposed method allows ASR models to learn from synthesis of large-scale unspoken text sources and achieves a 35% relative WER reduction on a voice-search task.
引用
收藏
页码:556 / 560
页数:5
相关论文
共 50 条
  • [1] GAN-based Data Generation for Speech Emotion Recognition
    Eskimez, Sefik Emre
    Dimitriadis, Dimitrios
    Gmyr, Robert
    Kumanati, Kenichi
    [J]. INTERSPEECH 2020, 2020, : 3446 - 3450
  • [2] IMPROVING GAN-BASED VOCODER FOR FAST AND HIGH-QUALITY SPEECH SYNTHESIS
    He, Mengnan
    Guo, Tingwei
    Lu, Zhengxin
    Zhang, Ruixiong
    Gong, Caixia
    [J]. INTERSPEECH 2022, 2022, : 1601 - 1605
  • [3] Improving Speech Synthesis by Automatic Speech Recognition and Speech Discriminator
    Huang, Li-Yu
    Chen, Chia-Ping
    [J]. JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2024, 40 (01) : 189 - 200
  • [4] Improving text-to-speech synthesis
    Tatham, M
    Lewis, E
    [J]. ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1856 - 1859
  • [5] Training Speech Recognition Model with Speech Synthesis and Text Discriminator
    Lin, Hou-an
    Chen, Chia-ping
    [J]. JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2024, 40 (02) : 359 - 373
  • [6] UNSUPERVISED DATA SELECTION FOR SPEECH RECOGNITION WITH CONTRASTIVE LOSS RATIOS
    Park, Chanho
    Ahmad, Rehan
    Hain, Thomas
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8587 - 8591
  • [7] A RESEARCH BED FOR UNIT SELECTION BASED TEXT TO SPEECH SYNTHESIS
    Sarathy, K. Partha
    Ramakrishnan, A. G.
    [J]. 2008 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY: SLT 2008, PROCEEDINGS, 2008, : 229 - +
  • [8] OPTIMIZATION OF COST FUNCTION WEIGHTS FOR UNIT SELECTION SPEECH SYNTHESIS USING SPEECH RECOGNITION
    Pobar, Miran
    Martincic-Ipsic, Sanda
    Ipsic, Ivo
    [J]. NEURAL NETWORK WORLD, 2012, 22 (05) : 429 - 441
  • [9] HiFi-GAN based Text-to-Speech Synthesis in Serbian
    Suzic, Sinisa
    Pekar, Darko
    Secujski, Milan
    Nosek, Tijana
    Delic, Vlado
    [J]. 2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 2231 - 2235
  • [10] HiFi-GAN based Text-to-Speech Synthesis in Serbian
    Suzic, Sinisa
    Pekar, Darko
    Secujski, Milan
    Nosek, Tijana
    Delic, Vlado
    [J]. 2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 1178 - 1182