Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding

被引:21
|
作者
Chen, Mengnan [1 ]
Chen, Minchuan [2 ]
Liang, Shuang [2 ]
Ma, Jun [2 ]
Chen, Lei [1 ]
Wang, Shaojun [2 ]
Xiao, Jing [2 ]
机构
[1] East China Normal Univ, Shanghai, Peoples R China
[2] Ping An Technol, Shenzhen, Guangdong, Peoples R China
来源
关键词
neural TTS; multi-speaker modeling; multilanguage; speaker embedding;
D O I
10.21437/Interspeech.2019-1632
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Neural network-based model for text-to-speech (TTS) synthesis has made significant progress in recent years. In this paper, we present a cross-lingual, multi-speaker neural end-to-end TTS framework which can model speaker characteristics and synthesize speech in different languages. We implement the model by introducing a separately trained neural speaker embedding network, which can represent the latent structure of different speakers and language pronunciations. We train the speech synthesis network bilingually and prove the possibility of synthesizing Chinese speaker's English speech and vice versa. We explore different methods to fit a new speaker using only a few speech samples. The experimental results show that, with only several minutes of audio from a new speaker, the proposed model can synthesize speech bilingually and acquire decent naturalness and similarity for both languages.
引用
收藏
页码:2105 / 2109
页数:5
相关论文
共 50 条
  • [1] DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech
    Liu, Sen
    Guo, Yiwei
    Du, Chenpeng
    Chen, Xie
    Yu, Kai
    [J]. INTERSPEECH 2023, 2023, : 616 - 620
  • [2] Cross-lingual Speaker Adaptation using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis
    Xin, Detai
    Saito, Yuki
    Takamichi, Shinnosuke
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    [J]. INTERSPEECH 2021, 2021, : 1614 - 1618
  • [3] Cross-lingual speaker adaptation using domain adaptation and speaker consistency loss for text-to-speech synthesis
    Xin, Detai
    Saito, Yuki
    Takamichi, Shinnosuke
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    [J]. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, 5 : 3376 - 3380
  • [4] Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
    Liu, Zhaoyu
    Mak, Brian
    [J]. INTERSPEECH 2020, 2020, : 2932 - 2936
  • [5] Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech
    Singh, Abhayjeet
    Nagireddi, Amala
    Jayakumar, Anjali
    Deekshitha, G.
    Bandekar, Jesuraja
    Roopa, R.
    Badiger, Sandhya
    Udupa, Sathvik
    Kumar, Saurabh
    Ghosh, Prasanta Kumar
    Murthy, Hema A.
    Zen, Heiga
    Kumar, Pranaw
    Kant, Kamal
    Bole, Amol
    Singh, Bira Chandra
    Tokuda, Keiichi
    Hasegawa-Johnson, Mark
    Olbrich, Philipp
    [J]. IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2024, 5 : 790 - 798
  • [6] Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora
    Luong, Hieu-Thi
    Wang, Xin
    Yamagishi, Junichi
    Nishizawa, Nobuyuki
    [J]. INTERSPEECH 2019, 2019, : 1303 - 1307
  • [7] Cross-lingual multi-speaker speech synthesis with limited bilingual training data
    Cai, Zexin
    Yang, Yaogen
    Li, Ming
    [J]. COMPUTER SPEECH AND LANGUAGE, 2023, 77
  • [8] Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes
    Mitsui, Kentaro
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    [J]. INTERSPEECH 2020, 2020, : 2032 - 2036
  • [9] Deep Voice 2: Multi-Speaker Neural Text-to-Speech
    Arik, Sercan O.
    Diamos, Gregory
    Gibiansky, Andrew
    Miller, John
    Peng, Kainan
    Ping, Wei
    Raiman, Jonathan
    Zhou, Yanqi
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [10] Multi-speaker Emotional Text-to-speech Synthesizer
    Cho, Sungjae
    Lee, Soo-Young
    [J]. INTERSPEECH 2021, 2021, : 2337 - 2338