Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment

被引：8

作者：

Liu, Zhaoyu ^{[1
]}

Mak, Brian ^{[1
]}

机构：

[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Peoples R China

来源：

INTERSPEECH 2020 | 2020年

关键词：

multi-lingual; multi-speaker; text-to-speech; x-vector; tone/stress embedding;

D O I：

10.21437/Interspeech.2020-1464

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Recent studies in multi-lingual and multi-speaker text-to-speech synthesis proposed approaches that use proprietary corpora of performing artists and require fine-tuning to enroll new voices. To reduce these costs, we investigate a novel approach for generating high-quality speeches in multiple languages of speakers enrolled in their native language. In our proposed system, we introduce tone/stress embeddings which extend the language embedding to represent tone and stress information. By manipulating the tone/stress embedding input, our system can synthesize speeches in native accent or foreign accent. To support online enrollment of new speakers, we condition the Tacotron-based synthesizer on speaker embeddings derived from a pre-trained x-vector speaker encoder by transfer learning. We introduce a shared phoneme set to encourage more phoneme sharing compared with the IPA. Our MOS results demonstrate that the native speech in all languages is highly intelligible and natural. We also find L2-norm normalization and ZCA-whitening on x-vectors are helpful to improve the system stability and audio quality. We also find that the WaveNet performance is seemingly language-independent: the WaveNet model trained with any of the three supported languages in our system can be used to generate speeches in the other two languages very well.

引用

页码：2932 / 2936

页数：5

共 50 条

[1] Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech
Singh, Abhayjeet
Nagireddi, Amala
Jayakumar, Anjali
Deekshitha, G.
Bandekar, Jesuraja
Roopa, R.
Badiger, Sandhya
Udupa, Sathvik
Kumar, Saurabh
Ghosh, Prasanta Kumar
Murthy, Hema A.
Zen, Heiga
Kumar, Pranaw
Kant, Kamal
Bole, Amol
Singh, Bira Chandra
Tokuda, Keiichi
Hasegawa-Johnson, Mark
Olbrich, Philipp
IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2024, 5 : 790 - 798
[2] LIGHT-TTS: LIGHTWEIGHT MULTI-SPEAKER MULTI-LINGUAL TEXT-TO-SPEECH
Li, Song
Ouyang, Beibei
Li, Lin
Hong, Qingyang
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8383 - 8387
[3] A Controllable Multi-Lingual Multi-Speaker Multi-Style Text-to-Speech Synthesis With Multivariate Information Minimization
Cheon, Sung Jun
Choi, Byoung Jin
Kim, Minchan
Lee, Hyeonseung
Kim, Nam Soo
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 55 - 59
[4] Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
Chen, Mengnan
Chen, Minchuan
Liang, Shuang
Ma, Jun
Chen, Lei
Wang, Shaojun
Xiao, Jing
INTERSPEECH 2019, 2019, : 2105 - 2109
[5] LIMMITS'24: Multi-Speaker, Multi-Lingual INDIC TTS With Voice Cloning
Udupa, Sathvik
Bandekar, Jesuraja
Singh, Abhayjeet
Deekshitha, G.
Kumar, Saurabh
Badiger, Sandhya
Nagireddi, Amala
Roopa, R.
Ghosh, Prasanta Kumar
Murthy, Hema A.
Kumar, Pranaw
Tokuda, Keiichi
Hasegawa-Johnson, Mark
Olbrich, Philipp
IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2025, 6 : 293 - 302
[6] Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Arik, Sercan O.
Diamos, Gregory
Gibiansky, Andrew
Miller, John
Peng, Kainan
Ping, Wei
Raiman, Jonathan
Zhou, Yanqi
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
[7] Transfer Learning for Low-Resource, Multi-Lingual, and Zero-Shot Multi-Speaker Text-to-Speech
Jeong, Myeonghun
Kim, Minchan
Choi, Byoung Jin
Yoon, Jaesam
Jang, Won
Kim, Nam Soo
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1519 - 1530
[8] Multi-Speaker Text-to-Speech Training With Speaker Anonymized Data
Huang, Wen-Chin
Wu, Yi-Chiao
Toda, Tomoki
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2995 - 2999
[9] Multi-speaker Emotional Text-to-speech Synthesizer
Cho, Sungjae
Lee, Soo-Young
INTERSPEECH 2021, 2021, : 2337 - 2338
[10] LIMMITS'24: MULTI-SPEAKER, MULTI-LINGUAL INDIC TTS WITH VOICE CLONING<bold> </bold>
Singh, Abhayjeet
Nagireddi, Amala
Deekshitha, G.
Bandekar, Jesuraja
Roopa, R.
Badiger, Sandhya
Udupa, Sathvik
Ghosh, Prasanta Kumar
Murthy, Hema A.
Kumar, Pranaw
Tokuda, Keiichi
Hasegawa-Johnson, Mark
Olbrich, Philipp
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 61 - 62

← 1 2 3 4 5 →