Improving Speech Recognition using GAN-based Speech Synthesis and Contrastive Unspoken Text Selection

被引:16
|
作者
Chen, Zhehuai [1 ]
Rosenberg, Andrew [1 ]
Zhang, Yu [1 ]
Wang, Gary [1 ,2 ]
Ramabhadran, Bhuvana [1 ]
Moreno, Pedro J. [1 ]
机构
[1] Google, Mountain View, CA 94043 USA
[2] Simon Fraser Univ, Burnaby, BC, Canada
来源
关键词
Speech Synthesis; Speech Recognition; Generative Adversarial Network; Contrastive Data Selection;
D O I
10.21437/Interspeech.2020-1475
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Text-to-Speech synthesis (TTS) based data augmentation is a relatively new mechanism for utilizing text-only data to improve automatic speech recognition (ASR) training without parameter or inference architecture changes. However, efforts to train speech recognition systems on synthesized utterances suffer from limited acoustic diversity of TTS outputs. Additionally, the text-only corpus is always much larger than the transcribed speech corpus by several orders of magnitude, which makes speech synthesis of all the text data impractical. In this work, we propose to combine generative adversarial network (GAN) and multi-style training (MTR) to increase acoustic diversity in the synthesized data. We also present a contrastive language model-based data selection technique to improve the efficiency of learning from unspoken text. We demonstrate that our proposed method allows ASR models to learn from synthesis of large-scale unspoken text sources and achieves a 35% relative WER reduction on a voice-search task.
引用
收藏
页码:556 / 560
页数:5
相关论文
共 50 条
  • [41] A text clustering method based on speech to text and improved center selection
    Shi, Kan-Sheng
    Liu, Hai-Tao
    Song, Wen-Tao
    [J]. Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2012, 25 (06): : 996 - 1001
  • [42] EEG-based Classification of Bilingual Unspoken Speech using ANN
    Balaji, Advait
    Haldar, Aparajita
    Patil, Keshav
    Ruthvik, T. Sai
    Valliappan, C. A.
    Jartarkar, Mayur
    Baths, Veeky
    [J]. 2017 39TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2017, : 1022 - 1025
  • [43] Development of multi-lingual speech recognition and text-to-speech synthesis for automotive applications
    Deguchi, Y
    Kagoshima, T
    Hirabayashi, G
    Kanazawa, H
    [J]. TELEMATCS FOR VEHICLES, 2002, 1728 : 233 - 240
  • [44] Recognition for synthesis: Automatic parameter selection for resynthesis of emotional speech from neutral speech
    Bulut, Murtaza
    Lee, Sungbok
    Narayanan, Shrikanth
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4629 - 4632
  • [45] ScSer: Supervised Contrastive Learning for Speech Emotion Recognition using Transformers
    Alaparthi, Varun Sai
    Pasam, Tejeswara Reddy
    Inagandla, Deepak Abhiram
    Prakash, Jay
    Singh, Pramod Kumar
    [J]. 2022 15TH INTERNATIONAL CONFERENCE ON HUMAN SYSTEM INTERACTION (HSI), 2022,
  • [46] SELECTION OF A FORMANT SYNTHESIZER MODEL FOR TEXT-TO-SPEECH SYNTHESIS
    SINCLAIR, DA
    [J]. PROCEEDINGS : INSTITUTE OF ACOUSTICS, VOL 8, PART 7: SPEECH & HEARING, 1986, 8 : 363 - 369
  • [47] Efficient Unit-Selection in Text-to-Speech Synthesis
    Mihelic, Ales
    Gros, Jerneja Zganec
    [J]. TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2008, 5246 : 411 - 418
  • [48] Selection in a concatenative speech synthesis system using a large speech database
    Hunt, AJ
    Black, AW
    [J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 373 - 376
  • [49] FCAN : Speech emotion recognition network based on focused contrastive learning
    Kang, Hong
    Xu, Yunfeng
    Jin, Guowei
    Wang, Jialin
    Miao, Borui
    [J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2024, 96
  • [50] Design of The Speech Synthesis System from Text to Speech Based on the TTS Technique
    Guo-hong, Gao
    Xue-yong, Li
    Jin-na, Lv
    [J]. 2010 SECOND ETP/IITA WORLD CONGRESS IN APPLIED COMPUTING, COMPUTER SCIENCE, AND COMPUTER ENGINEERING, 2010, : 172 - 174