CAN WE USE COMMON VOICE TO TRAIN A MULTI-SPEAKER TTS SYSTEM?

被引:1
|
作者
Ogun, Sewade [1 ]
Colotte, Vincent [1 ]
Vincent, Emmanuel [1 ]
机构
[1] Univ Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France
关键词
Multi-speaker text-to-speech; Common Voice; crowdsourced corpus; non-intrusive quality estimation;
D O I
10.1109/SLT54892.2023.10022766
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Training of multi-speaker text-to-speech (TTS) systems relies on curated datasets based on high-quality recordings or audiobooks. Such datasets often lack speaker diversity and are expensive to collect. As an alternative, recent studies have leveraged the availability of large, crowdsourced automatic speech recognition (ASR) datasets. A major problem with such datasets is the presence of noisy and/or distorted samples, which degrade TTS quality. In this paper, we propose to automatically select high-quality training samples using a non-intrusive mean opinion score (MOS) estimator, WVMOS. We show the viability of this approach for training a multi-speaker GlowTTS model on the Common Voice English dataset. Our approach improves the overall quality of generated utterances by 1.26 MOS point with respect to training on all the samples and by 0.35 MOS point with respect to training on the LibriTTS dataset. This opens the door to automatic TTS dataset curation for a wider range of languages.
引用
收藏
页码:900 / 905
页数:6
相关论文
共 50 条
  • [31] A Purely End-to-end System for Multi-speaker Speech Recognition
    Seki, Hiroshi
    Hori, Takaaki
    Watanabe, Shinji
    Le Roux, Jonathan
    Hershey, John R.
    [J]. PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 2620 - 2630
  • [32] ENERGY-BASED MULTI-SPEAKER VOICE ACTIVITY DETECTION WITH AN AD HOC MICROPHONE ARRAY
    Bertrand, Alexander
    Moonen, Marc
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 85 - 88
  • [33] END-TO-END MONAURAL MULTI-SPEAKER ASR SYSTEM WITHOUT PRETRAINING
    Chang, Xuankai
    Qian, Yanmin
    Yu, Kai
    Watanabe, Shinji
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6256 - 6260
  • [34] CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer
    Karlapati, Sri
    Karanasou, Penny
    Lajszczak, Mateusz
    Abbas, Ammar
    Moinet, Alexis
    Makarov, Peter
    Li, Ray
    van Korlaar, Arent
    Slangen, Simon
    Drugman, Thomas
    [J]. INTERSPEECH 2022, 2022, : 3363 - 3367
  • [35] Integrating a Voice Analysis-Synthesis System with a TTS Framework for Controlling Affect and Speaker Identity
    Murphy, Andy
    Yanushevskaya, Irena
    Chasaide, Ailbhe Ni
    Gobl, Christer
    [J]. 2021 32ND IRISH SIGNALS AND SYSTEMS CONFERENCE (ISSC 2021), 2021,
  • [36] Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages
    Azizah, Kurniawati
    Adriani, Mirna
    Jatmiko, Wisnu
    [J]. IEEE ACCESS, 2020, 8 : 179798 - 179812
  • [37] Multi-speaker Chinese news broadcasting system based on improved Tacotron2
    Zhao, Wei
    Lian, Yue
    Chai, Jianping
    Tu, Zhongwen
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (30) : 46905 - 46937
  • [38] Multi-speaker Chinese news broadcasting system based on improved Tacotron2
    Wei Zhao
    Yue Lian
    Jianping Chai
    Zhongwen Tu
    [J]. Multimedia Tools and Applications, 2023, 82 : 46905 - 46937
  • [39] MULTI-SPEAKER VOICE ACTIVITY DETECTION BY AN IMPROVED MULTIPLICATIVE NON-NEGATIVE INDEPENDENT COMPONENT ANALYSIS WITH SPARSENESS CONSTRAINTS
    Hamaidi, L. Khadidja
    Muma, Michael
    Zoubir, Abdelhak M.
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4611 - 4615
  • [40] INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS
    Raj, Desh
    Denisov, Pavel
    Chen, Zhuo
    Erdogan, Hakan
    Huang, Zili
    He, Maokui
    Watanabe, Shinji
    Du, Jun
    Yoshioka, Takuya
    Luo, Yi
    Kanda, Naoyuki
    Li, Jinyu
    Wisdom, Scott
    Hershey, John R.
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 897 - 904