An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets

被引:2
|
作者
Gallegos, Pilar Oplustil [1 ]
Williams, Jennifer [1 ]
Rownicka, Joanna [1 ]
King, Simon [1 ]
机构
[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
来源
基金
英国工程与自然科学研究理事会;
关键词
speech synthesis; data; clustering; speaker representation; sequence-to-sequence models; multi-speaker;
D O I
10.21437/Interspeech.2020-2567
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Large multi-speaker datasets for TTS typically contain diverse speakers, recording conditions, styles and quality of data. Although one might generally presume that more data is better, in this paper we show that a model trained on a carefully-chosen subset of speakers from LibriTTS provides significantly better quality synthetic speech than a model trained on a larger set. We propose an unsupervised methodology to find this subset by clustering per-speaker acoustic representations.
引用
收藏
页码:1758 / 1762
页数:5
相关论文
共 50 条
  • [41] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER
    Chang, Xuankai
    Zhang, Wangyou
    Qian, Yanmin
    Le Roux, Jonathan
    Watanabe, Shinji
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6134 - 6138
  • [42] INVESTIGATING ON INCORPORATING PRETRAINED AND LEARNABLE SPEAKER REPRESENTATIONS FOR MULTI-SPEAKER MULTI-STYLE TEXT-TO-SPEECH
    Chien, Chung-Ming
    Lin, Jheng-Hao
    Huang, Chien-yu
    Hsu, Po-chun
    Lee, Hung-yi
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8588 - 8592
  • [43] Sparse Component Analysis for Speech Recognition in Multi-Speaker Environment
    Asaei, Afsaneh
    Bourlard, Herve
    Garner, Philip N.
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1704 - 1707
  • [44] Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding
    Kim, Minsoo
    Jang, Gil-Jin
    [J]. Applied Sciences (Switzerland), 2024, 14 (18):
  • [45] Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis
    Hashimoto, Kei
    Nankaku, Yoshihiko
    Tokuda, Keiichi
    [J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 120 - 123
  • [46] Cross-lingual multi-speaker speech synthesis with limited bilingual training data
    Cai, Zexin
    Yang, Yaogen
    Li, Ming
    [J]. COMPUTER SPEECH AND LANGUAGE, 2023, 77
  • [47] SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems
    Yoon, Hyungchan
    Kim, Changhwan
    Um, Seyun
    Yoon, Hyun-Wook
    Kang, Hong-Goo
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 593 - 597
  • [48] Silent versus modal multi-speaker speech recognition from ultrasound and video
    Ribeiro, Manuel Sam
    Eshky, Aciel
    Richmond, Korin
    Renals, Steve
    [J]. INTERSPEECH 2021, 2021, : 641 - 645
  • [49] Neural Speech Tracking Highlights the Importance of Visual Speech in Multi-speaker Situations
    Haider, Chandra L.
    Park, Hyojin
    Hauswald, Anne
    Weisz, Nathan
    [J]. JOURNAL OF COGNITIVE NEUROSCIENCE, 2024, 36 (01) : 128 - 142
  • [50] An emotional speech synthesis markup language processor for multi-speaker and emotional text-to-speech applications
    Ryu, Se-Hui
    Cho, Hee
    Lee, Ju-Hyun
    Hong, Ki-Hyung
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2021, 40 (05): : 523 - 529