Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes

被引：2

作者：

Mitsui, Kentaro ^{[1
]}

Koriyama, Tomoki ^{[1
]}

Saruwatari, Hiroshi ^{[1
]}

机构：

[1] Univ Tokyo, Tokyo, Japan

来源：

INTERSPEECH 2020 | 2020年

关键词：

deep Gaussian process; statistical speech synthesis; multi-speaker modeling; latent variable model; SPEAKER ADAPTATION;

D O I：

10.21437/Interspeech.2020-3167

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Multi-speaker speech synthesis is a technique for modeling multiple speakers' voices with a single model. Although many approaches using deep neural networks (DNNs) have been proposed, DNNs are prone to overfitting when the amount of training data is limited. We propose a framework for multi-speaker speech synthesis using deep Gaussian processes (DGPs); a DGP is a deep architecture of Bayesian kernel regressions and thus robust to overfitting. In this framework, speaker information is fed to duration/acoustic models using speaker codes. We also examine the use of deep Gaussian process latent variable models (DGPLVMs). In this approach, the representation of each speaker is learned simultaneously with other model parameters, and therefore the similarity or dissimilarity of speakers is considered efficiently. We experimentally evaluated two situations to investigate the effectiveness of the proposed methods. In one situation, the amount of data from each speaker is balanced (speaker-balanced), and in the other, the data from certain speakers are limited (speaker-imbalanced). Subjective and objective evaluation results showed that both the DGP and DGPLVM synthesize multi-speaker speech more effective than a DNN in the speaker-balanced situation. We also found that the DGPLVM outperforms the DGP significantly in the speaker-imbalanced situation.

引用

页码：2032 / 2036

页数：5

共 50 条

[1] Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Arik, Sercan O.
Diamos, Gregory
Gibiansky, Andrew
Miller, John
Peng, Kainan
Ping, Wei
Raiman, Jonathan
Zhou, Yanqi
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
[2] ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis
Xue, Jinlong
Deng, Yayue
Han, Yichen
Li, Ya
Sun, Jianqing
Liang, Jiaen
[J]. 2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 230 - 234
[3] Multi-speaker Emotional Text-to-speech Synthesizer
Cho, Sungjae
Lee, Soo-Young
[J]. INTERSPEECH 2021, 2021, : 2337 - 2338
[4] Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
Chen, Mengnan
Chen, Minchuan
Liang, Shuang
Ma, Jun
Chen, Lei
Wang, Shaojun
Xiao, Jing
[J]. INTERSPEECH 2019, 2019, : 2105 - 2109
[5] Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation
Tu, Tao
Chen, Yuan-Jui
Liu, Alexander H.
Lee, Hung-yi
[J]. INTERSPEECH 2020, 2020, : 3191 - 3195
[6] Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
Liu, Zhaoyu
Mak, Brian
[J]. INTERSPEECH 2020, 2020, : 2932 - 2936
[7] Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora
Luong, Hieu-Thi
Wang, Xin
Yamagishi, Junichi
Nishizawa, Nobuyuki
[J]. INTERSPEECH 2019, 2019, : 1303 - 1307
[8] Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech
Singh, Abhayjeet
Nagireddi, Amala
Jayakumar, Anjali
Deekshitha, G.
Bandekar, Jesuraja
Roopa, R.
Badiger, Sandhya
Udupa, Sathvik
Kumar, Saurabh
Ghosh, Prasanta Kumar
Murthy, Hema A.
Zen, Heiga
Kumar, Pranaw
Kant, Kamal
Bole, Amol
Singh, Bira Chandra
Tokuda, Keiichi
Hasegawa-Johnson, Mark
Olbrich, Philipp
[J]. IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2024, 5 : 790 - 798
[9] An emotional speech synthesis markup language processor for multi-speaker and emotional text-to-speech applications
Ryu, Se-Hui
Cho, Hee
Lee, Ju-Hyun
Hong, Ki-Hyung
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2021, 40 (05): : 523 - 529
[10] Towards Zero-Shot Multi-Speaker Multi-Accent Text-to-Speech Synthesis
Zhang, Mingyang
Zhou, Xuehao
Wu, Zhizheng
Li, Haizhou
[J]. IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 947 - 951

← 1 2 3 4 5 →