CAN WE USE COMMON VOICE TO TRAIN A MULTI-SPEAKER TTS SYSTEM?

被引:1
|
作者
Ogun, Sewade [1 ]
Colotte, Vincent [1 ]
Vincent, Emmanuel [1 ]
机构
[1] Univ Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France
关键词
Multi-speaker text-to-speech; Common Voice; crowdsourced corpus; non-intrusive quality estimation;
D O I
10.1109/SLT54892.2023.10022766
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Training of multi-speaker text-to-speech (TTS) systems relies on curated datasets based on high-quality recordings or audiobooks. Such datasets often lack speaker diversity and are expensive to collect. As an alternative, recent studies have leveraged the availability of large, crowdsourced automatic speech recognition (ASR) datasets. A major problem with such datasets is the presence of noisy and/or distorted samples, which degrade TTS quality. In this paper, we propose to automatically select high-quality training samples using a non-intrusive mean opinion score (MOS) estimator, WVMOS. We show the viability of this approach for training a multi-speaker GlowTTS model on the Common Voice English dataset. Our approach improves the overall quality of generated utterances by 1.26 MOS point with respect to training on all the samples and by 0.35 MOS point with respect to training on the LibriTTS dataset. This opens the door to automatic TTS dataset curation for a wider range of languages.
引用
收藏
页码:900 / 905
页数:6
相关论文
共 50 条
  • [41] Robust Distributed Multi-Speaker Voice Activity Detection Using Stability Selection for Sparse Non-Negative Feature Extraction
    Hamaidi, L. Khadidja
    Muma, Michael
    Zoubir, Abdelhak M.
    [J]. 2017 25TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2017, : 161 - 165
  • [42] Can we use technology to train inspectors to be more systematic?
    Sadasivan, Sajay
    Gramopadhye, Arland K.
    [J]. DIGITAL HUMAN MODELING, 2007, 4561 : 959 - 968
  • [43] THE HUYA MULTI-SPEAKER AND MULTI-STYLE SPEECH SYNTHESIS SYSTEM FOR M2VOC CHALLENGE 2020
    Wang, Jie
    You, Yuren
    Liu, Feng
    Tuo, Deyi
    Kang, Shiyin
    Wu, Zhiyong
    Meng, Helen
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8608 - 8612
  • [44] VCVTS: MULTI-SPEAKER VIDEO-TO-SPEECH SYNTHESIS VIA CROSS-MODAL KNOWLEDGE TRANSFER FROM VOICE CONVERSION
    Wang, Disong
    Yang, Shan
    Su, Dan
    Liu, Xunying
    Yu, Dong
    Meng, Helen
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7252 - 7256
  • [45] Can we Train the Immune System of Patients With Cystic Fibrosis?
    Tur-Torres, Jordi
    Traversi, Letizia
    Martinez-Gallo, Monica
    Assante, Giulio
    Mesones, Christian Eduardo Romero
    Alarcon, David Clofent
    Chang-Macchiu, Patricia
    Alvarez, Antoni
    Polverino, Eva
    [J]. ARCHIVOS DE BRONCONEUMOLOGIA, 2021, 57 (11): : 708 - 710
  • [46] Zero-Shot vs. Few-Shot Multi-speaker TTS Using Pre-trained Czech SpeechT5 Model
    Lehecka, Jan
    Hanzlicek, Zdenek
    Matousek, Jindrich
    Tihelka, Daniel
    [J]. TEXT, SPEECH, AND DIALOGUE, TSD 2024, PT II, 2024, 15049 : 46 - 57
  • [47] MSDTRON: A HIGH-CAPABILITY MULTI-SPEAKER SPEECH SYNTHESIS SYSTEM FOR DIVERSE DATA USING CHARACTERISTIC INFORMATION
    Wu, Qinghua
    Shen, Quanbo
    Luan, Jian
    Wang, Yujun
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6327 - 6331
  • [48] CAN WE USE SPEAKER RECOGNITION TECHNOLOGY TO ATTACK ITSELF? ENHANCING MIMICRY ATTACKS USING AUTOMATIC TARGET SPEAKER SELECTION
    Kinnunen, Tomi
    Hautamaki, Rosa Gonzalez
    Vestman, Ville
    Sahidullah, Md
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6146 - 6150
  • [49] Automatic Multi-Speaker Speech Recognition System Based on Time-Frequency Blind Source Separation under Ubiquitous Environment
    Wang, Zhe
    Zhang, Haijian
    Bi, Guoan
    Li, Xiumei
    [J]. PROCEEDINGS OF THE 2014 9TH IEEE CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2014, : 101 - +
  • [50] ROBUST DISTRIBUTED SPARSITY-CONSTRAINED NON-NEGATIVE SOURCE SEPARATION AND MULTI-SPEAKER VOICE ACTIVITY DETECTION FOR SPEECH ENHANCEMENT IN WIRELESS ACOUSTIC SENSOR NETWORKS
    Hamaidi, L. Khadidja
    Muma, Michael
    Zoubii, Abdelhak M.
    [J]. 2018 INTERNATIONAL CONFERENCE ON SIGNALS AND SYSTEMS (ICSIGSYS), 2018, : 161 - 166