CAN WE USE COMMON VOICE TO TRAIN A MULTI-SPEAKER TTS SYSTEM?

被引:1
|
作者
Ogun, Sewade [1 ]
Colotte, Vincent [1 ]
Vincent, Emmanuel [1 ]
机构
[1] Univ Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France
关键词
Multi-speaker text-to-speech; Common Voice; crowdsourced corpus; non-intrusive quality estimation;
D O I
10.1109/SLT54892.2023.10022766
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Training of multi-speaker text-to-speech (TTS) systems relies on curated datasets based on high-quality recordings or audiobooks. Such datasets often lack speaker diversity and are expensive to collect. As an alternative, recent studies have leveraged the availability of large, crowdsourced automatic speech recognition (ASR) datasets. A major problem with such datasets is the presence of noisy and/or distorted samples, which degrade TTS quality. In this paper, we propose to automatically select high-quality training samples using a non-intrusive mean opinion score (MOS) estimator, WVMOS. We show the viability of this approach for training a multi-speaker GlowTTS model on the Common Voice English dataset. Our approach improves the overall quality of generated utterances by 1.26 MOS point with respect to training on all the samples and by 0.35 MOS point with respect to training on the LibriTTS dataset. This opens the door to automatic TTS dataset curation for a wider range of languages.
引用
收藏
页码:900 / 905
页数:6
相关论文
共 50 条
  • [1] Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?
    Cooper, Erica
    Lai, Cheng-, I
    Yasuda, Yusuke
    Yamagishi, Junichi
    [J]. INTERSPEECH 2020, 2020, : 3979 - 3983
  • [2] MULTI-SPEAKER MODELING AND SPEAKER ADAPTATION FOR DNN-BASED TTS SYNTHESIS
    Fan, Yuchen
    Qian, Yao
    Soong, Frank K.
    He, Lei
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4475 - 4479
  • [3] Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations
    Jeon, Yejin
    Kim, Yunsu
    Lee, Gary Geunbae
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18336 - 18344
  • [4] Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus
    Liang, Kailin
    Liu, Bin
    Hu, Yifan
    Liu, Rui
    Bao, Feilong
    Gao, Guanglai
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (07):
  • [5] Multi-speaker voice cryptographic key generation
    Paola Garcia-Perera, L.
    Carlos Mex-Perera, J.
    Nolazco-Flores, Juan A.
    [J]. 3RD ACS/IEEE INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, 2005, 2005,
  • [6] Multi-speaker Beamforming for Voice Activity Classification
    Tran, Thuy N.
    Cowley, William
    Pollok, Andre
    [J]. 2013 AUSTRALIAN COMMUNICATIONS THEORY WORKSHOP (AUSCTW), 2013, : 116 - 121
  • [7] AISHELL-3: A Multi-Speaker Mandarin TTS Corpus
    Shi, Yao
    Bu, Hui
    Xu, Xin
    Zhang, Shaoji
    Li, Ming
    [J]. INTERSPEECH 2021, 2021, : 2756 - 2760
  • [8] Hi-Fi Multi-Speaker English TTS Dataset
    Bakhturina, Evelina
    Lavrukhin, Vitaly
    Ginsburg, Boris
    Zhang, Yang
    [J]. INTERSPEECH 2021, 2021, : 2776 - 2780
  • [9] Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS
    Udagawa, Kenta
    Saito, Yuki
    Saruwatari, Hiroshi
    [J]. INTERSPEECH 2022, 2022, : 2968 - 2972
  • [10] YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone
    Casanova, Edresson
    Weber, Julian
    Shulby, Christopher
    Candido Junior, Arnaldo
    Goelge, Eren
    Ponti, Moacir Antonelli
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,