Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding

被引:21
|
作者
Chen, Mengnan [1 ]
Chen, Minchuan [2 ]
Liang, Shuang [2 ]
Ma, Jun [2 ]
Chen, Lei [1 ]
Wang, Shaojun [2 ]
Xiao, Jing [2 ]
机构
[1] East China Normal Univ, Shanghai, Peoples R China
[2] Ping An Technol, Shenzhen, Guangdong, Peoples R China
来源
关键词
neural TTS; multi-speaker modeling; multilanguage; speaker embedding;
D O I
10.21437/Interspeech.2019-1632
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Neural network-based model for text-to-speech (TTS) synthesis has made significant progress in recent years. In this paper, we present a cross-lingual, multi-speaker neural end-to-end TTS framework which can model speaker characteristics and synthesize speech in different languages. We implement the model by introducing a separately trained neural speaker embedding network, which can represent the latent structure of different speakers and language pronunciations. We train the speech synthesis network bilingually and prove the possibility of synthesizing Chinese speaker's English speech and vice versa. We explore different methods to fit a new speaker using only a few speech samples. The experimental results show that, with only several minutes of audio from a new speaker, the proposed model can synthesize speech bilingually and acquire decent naturalness and similarity for both languages.
引用
收藏
页码:2105 / 2109
页数:5
相关论文
共 50 条
  • [31] DNN based multi-speaker speech synthesis with temporal auxiliary speaker ID embedding
    Lee, Junmo
    Song, Kwangsub
    Noh, Kyoungjin
    Park, Tae-Jun
    Chang, Joon-Hyuk
    [J]. 2019 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATION (ICEIC), 2019, : 61 - 64
  • [32] Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios
    Xie, Qicong
    Li, Tao
    Wang, Xinsheng
    Wang, Zhichao
    Xie, Lei
    Yu, Guoqiao
    Wan, Guanglu
    [J]. 2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 66 - 70
  • [33] Cross-Lingual Speaker Adaptation for Statistical Speech Synthesis Using Limited Data
    Saffjoo, Seyyed Saeed
    Demiroglu, Cenk
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 317 - 321
  • [34] Cross-Lingual Speaker Discrimination Using Natural and Synthetic Speech
    Wester, Mirjam
    Liang, Hui
    [J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2492 - 2495
  • [35] MultiSpeech: Multi-Speaker Text to Speech with Transformer
    Chen, Mingjian
    Tan, Xu
    Ren, Yi
    Xu, Jin
    Sun, Hao
    Zhao, Sheng
    Qin, Tao
    [J]. INTERSPEECH 2020, 2020, : 4024 - 4028
  • [36] NNSPEECH: SPEAKER-GUIDED CONDITIONAL VARIATIONAL AUTOENCODER FOR ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH
    Zhao, Botao
    Zhang, Xulong
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4293 - 4297
  • [37] CROSS-LINGUAL SPEAKER ADAPTATION FOR HMM-BASED SPEECH SYNTHESIS
    Wu, Yi-Jian
    King, Simon
    Tokuda, Keiichi
    [J]. 2008 6TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, 2008, : 9 - 12
  • [38] SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems
    Yoon, Hyungchan
    Kim, Changhwan
    Um, Seyun
    Yoon, Hyun-Wook
    Kang, Hong-Goo
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 593 - 597
  • [39] Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data
    Zhang, Xulong
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    [J]. 2022 18TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING, MSN, 2022, : 456 - 460
  • [40] SPEAKER INTONATION ADAPTATION FOR TRANSFORMING TEXT-TO-SPEECH SYNTHESIS SPEAKER IDENTITY
    Langarani, Mahsa Sadat Elyasi
    van Santen, Jan
    [J]. 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 116 - 123