Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?

被引:8
|
作者
Cooper, Erica [1 ]
Lai, Cheng-, I [2 ]
Yasuda, Yusuke [1 ]
Yamagishi, Junichi [1 ]
机构
[1] Natl Inst Informat, Tokyo, Japan
[2] MIT, 77 Massachusetts Ave, Cambridge, MA 02139 USA
来源
关键词
Speaker augmentation; Speech synthesis; dialect identification; channel modeling; transfer learning;
D O I
10.21437/Interspeech.2020-1229
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Previous work on speaker adaptation for end-to-end speech synthesis still falls short in speaker similarity. We investigate an orthogonal approach to the current speaker adaptation paradigms, speaker augmentation, by creating artificial speakers and by taking advantage of low-quality data. The base Tacotron2 model is modified to account for the channel and dialect factors inherent in these corpora. In addition, we describe a warm-start training strategy that we adopted for Tacotron2 training. A large-scale listening test is conducted, and a distance metric is adopted to evaluate synthesis of dialects. This is followed by an analysis on synthesis quality, speaker and dialect similarity, and a remark on the effectiveness of our speaker augmentation approach. Audio samples are available online(1).
引用
收藏
页码:3979 / 3983
页数:5
相关论文
共 50 条
  • [1] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION
    Settle, Shane
    Le Roux, Jonathan
    Hori, Takaaki
    Watanabe, Shinji
    Hershey, John R.
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4819 - 4823
  • [2] End-to-End Multilingual Multi-Speaker Speech Recognition
    Seki, Hiroshi
    Hori, Takaaki
    Watanabe, Shinji
    Le Roux, Jonathan
    Hershey, John R.
    [J]. INTERSPEECH 2019, 2019, : 3755 - 3759
  • [3] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER
    Chang, Xuankai
    Zhang, Wangyou
    Qian, Yanmin
    Le Roux, Jonathan
    Watanabe, Shinji
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6134 - 6138
  • [4] End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning
    Denisov, Pavel
    Ngoc Thang Vu
    [J]. INTERSPEECH 2019, 2019, : 4425 - 4429
  • [5] END-TO-END MULTI-SPEAKER ASR WITH INDEPENDENT VECTOR ANALYSIS
    Scheibler, Robin
    Zhang, Wangyou
    Chang, Xuankai
    Watanabe, Shinji
    Qian, Yanmin
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 496 - 501
  • [6] A Purely End-to-end System for Multi-speaker Speech Recognition
    Seki, Hiroshi
    Hori, Takaaki
    Watanabe, Shinji
    Le Roux, Jonathan
    Hershey, John R.
    [J]. PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 2620 - 2630
  • [7] EXTENDED GRAPH TEMPORAL CLASSIFICATION FOR MULTI-SPEAKER END-TO-END ASR
    Chang, Xuankai
    Moritz, Niko
    Hori, Takaaki
    Watanabe, Shinji
    Le Roux, Jonathan
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7322 - 7326
  • [8] END-TO-END MONAURAL MULTI-SPEAKER ASR SYSTEM WITHOUT PRETRAINING
    Chang, Xuankai
    Qian, Yanmin
    Yu, Kai
    Watanabe, Shinji
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6256 - 6260
  • [9] Real-time End-to-End Monaural Multi-speaker Speech Recognition
    Li, Song
    Ouyang, Beibei
    Tong, Fuchuan
    Liao, Dexin
    Li, Lin
    Hong, Qingyang
    [J]. INTERSPEECH 2021, 2021, : 3750 - 3754
  • [10] MIMO-SPEECH: END-TO-END MULTI-CHANNEL MULTI-SPEAKER SPEECH RECOGNITION
    Chang, Xuankai
    Zhang, Wangyou
    Qian, Yanmin
    Le Roux, Jonathan
    Watanabe, Shinji
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 237 - 244