An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets

被引：2

作者：

Gallegos, Pilar Oplustil ^{[1
]}

Williams, Jennifer ^{[1
]}

Rownicka, Joanna ^{[1
]}

King, Simon ^{[1
]}

机构：

[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland

来源：

INTERSPEECH 2020 | 2020年

基金：

英国工程与自然科学研究理事会;

关键词：

speech synthesis; data; clustering; speaker representation; sequence-to-sequence models; multi-speaker;

D O I：

10.21437/Interspeech.2020-2567

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Large multi-speaker datasets for TTS typically contain diverse speakers, recording conditions, styles and quality of data. Although one might generally presume that more data is better, in this paper we show that a model trained on a carefully-chosen subset of speakers from LibriTTS provides significantly better quality synthetic speech than a model trained on a larger set. We propose an unsupervised methodology to find this subset by clustering per-speaker acoustic representations.

引用

页码：1758 / 1762

页数：5

共 50 条

[41] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER
Chang, Xuankai
Zhang, Wangyou
Qian, Yanmin
Le Roux, Jonathan
Watanabe, Shinji
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6134 - 6138
[42] INVESTIGATING ON INCORPORATING PRETRAINED AND LEARNABLE SPEAKER REPRESENTATIONS FOR MULTI-SPEAKER MULTI-STYLE TEXT-TO-SPEECH
Chien, Chung-Ming
Lin, Jheng-Hao
Huang, Chien-yu
Hsu, Po-chun
Lee, Hung-yi
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8588 - 8592
[43] Sparse Component Analysis for Speech Recognition in Multi-Speaker Environment
Asaei, Afsaneh
Bourlard, Herve
Garner, Philip N.
[J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1704 - 1707
[44] Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding
Kim, Minsoo
Jang, Gil-Jin
[J]. Applied Sciences (Switzerland), 2024, 14 (18):
[45] Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis
Hashimoto, Kei
Nankaku, Yoshihiko
Tokuda, Keiichi
[J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 120 - 123
[46] Cross-lingual multi-speaker speech synthesis with limited bilingual training data
Cai, Zexin
Yang, Yaogen
Li, Ming
[J]. COMPUTER SPEECH AND LANGUAGE, 2023, 77
[47] SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems
Yoon, Hyungchan
Kim, Changhwan
Um, Seyun
Yoon, Hyun-Wook
Kang, Hong-Goo
[J]. IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 593 - 597
[48] Silent versus modal multi-speaker speech recognition from ultrasound and video
Ribeiro, Manuel Sam
Eshky, Aciel
Richmond, Korin
Renals, Steve
[J]. INTERSPEECH 2021, 2021, : 641 - 645
[49] Neural Speech Tracking Highlights the Importance of Visual Speech in Multi-speaker Situations
Haider, Chandra L.
Park, Hyojin
Hauswald, Anne
Weisz, Nathan
[J]. JOURNAL OF COGNITIVE NEUROSCIENCE, 2024, 36 (01) : 128 - 142
[50] An emotional speech synthesis markup language processor for multi-speaker and emotional text-to-speech applications
Ryu, Se-Hui
Cho, Hee
Lee, Ju-Hyun
Hong, Ki-Hyung
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2021, 40 (05): : 523 - 529

← 1 2 3 4 5 →