DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech

被引:1
|
作者
Liu, Sen [1 ]
Guo, Yiwei [1 ]
Du, Chenpeng [1 ]
Chen, Xie [1 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, MoE Key Lab Artificial Intelligence, AI Inst,X LANCE Lab, Shanghai, Peoples R China
来源
关键词
cross-lingual text-to-speech; dual speaker embedding; vector-quantized acoustic feature;
D O I
10.21437/Interspeech.2023-363
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Although high-fidelity speech can be obtained for intralingual speech synthesis, cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres (i.e. speaker similarity) and eliminate the accents from their first language (i.e. nativeness). In this paper, we demonstrated that vector-quantized (VQ) acoustic feature contains less speaker information than mel-spectrogram. Based on this finding, we propose a novel dual speaker embedding TTS (DSE-TTS) framework for CTTS with authentic speaking style. Here, one embedding is fed to the acoustic model to learn the linguistic speaking style, while the other one is integrated into the vocoder to mimic the target speaker's timbre. Experiments show that by combining both embeddings, DSETTS significantly outperforms the state-of-the-art SANE-TTS in cross-lingual synthesis, especially in terms of nativeness.
引用
收藏
页码:616 / 620
页数:5
相关论文
共 50 条
  • [1] Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
    Chen, Mengnan
    Chen, Minchuan
    Liang, Shuang
    Ma, Jun
    Chen, Lei
    Wang, Shaojun
    Xiao, Jing
    [J]. INTERSPEECH 2019, 2019, : 2105 - 2109
  • [2] METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer
    Zhu, Xinfa
    Lei, Yi
    Li, Tao
    Zhang, Yongmao
    Zhou, Hongbin
    Lu, Heng
    Xie, Lei
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1506 - 1518
  • [3] Cross-lingual Speaker Adaptation using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis
    Xin, Detai
    Saito, Yuki
    Takamichi, Shinnosuke
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    [J]. INTERSPEECH 2021, 2021, : 1614 - 1618
  • [4] Cross-lingual speaker adaptation using domain adaptation and speaker consistency loss for text-to-speech synthesis
    Xin, Detai
    Saito, Yuki
    Takamichi, Shinnosuke
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    [J]. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, 5 : 3376 - 3380
  • [5] Cross-lingual Text-To-Speech Synthesis via Domain Adaptation and Perceptual Similarity Regression in Speaker Space
    Xin, Detai
    Saito, Yuki
    Takamichi, Shinnosuke
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    [J]. INTERSPEECH 2020, 2020, : 2947 - 2951
  • [6] Language Agnostic Speaker Embedding for Cross-Lingual Personalized Speech Generation
    Zhou, Yi
    Tian, Xiaohai
    Li, Haizhou
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3427 - 3439
  • [7] LIGHT-TTS: LIGHTWEIGHT MULTI-SPEAKER MULTI-LINGUAL TEXT-TO-SPEECH
    Li, Song
    Ouyang, Beibei
    Li, Lin
    Hong, Qingyang
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8383 - 8387
  • [8] Learning Speaker Embedding from Text-to-Speech
    Cho, Jaejin
    Zelasko, Piotr
    Villalba, Jesus
    Watanabe, Shinji
    Dehak, Najim
    [J]. INTERSPEECH 2020, 2020, : 3256 - 3260
  • [9] Exploring Timbre Disentanglement in Non-Autoregressive Cross-Lingual Text-to-Speech
    Zhan, Haoyue
    Yu, Xinyuan
    Zhang, Haitong
    Zhang, Yang
    Lin, Yue
    [J]. INTERSPEECH 2022, 2022, : 4247 - 4251
  • [10] DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech - A Study Between English and Mandarin
    Li, Tao
    Hu, Chenxu
    Cong, Jian
    Zhu, Xinfa
    Li, Jingbei
    Tian, Qiao
    Wang, Yuping
    Xie, Lei
    [J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2023, 31 : 3418 - 3430