METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer

被引:4
|
作者
Zhu, Xinfa [1 ]
Lei, Yi [1 ]
Li, Tao [1 ]
Zhang, Yongmao [1 ]
Zhou, Hongbin [2 ]
Lu, Heng [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLPNPU, Xian 710072, Peoples R China
[2] Ximalaya Inc, Shanghai 201203, Peoples R China
关键词
Cross-lingual; disentanglement; emotion transfer; speech synthesis; RECOGNITION; PROSODY;
D O I
10.1109/TASLP.2024.3363444
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Previous multilingual text-to-speech (TTS) approaches have considered leveraging monolingual speaker data to enable cross-lingual speech synthesis. However, such data-efficient approaches have ignored synthesizing emotional aspects of speech due to the challenges of cross-speaker cross-lingual emotion transfer - the heavy entanglement of speaker timbre, emotion and language factors in the speech signal will make a system to produce cross-lingual synthetic speech with an undesired foreign accent and weak emotion expressiveness. This paper proposes a Multilingual Emotional TTS (METTS) model to mitigate these problems, realizing both cross-speaker and cross-lingual emotion transfer. Specifically, METTS takes DelightfulTTS as the backbone model and proposes the following designs. First, to alleviate the foreign accent problem, METTS introduces multi-scale emotion modeling to disentangle speech prosody into coarse-grained and fine-grained scales, producing language-agnostic and language-specific emotion representations, respectively. Second, as a pre-processing step, formant shift based information perturbation is applied to the reference signal for better disentanglement of speaker timbre in the speech. Third, a vector quantization based emotion matcher is designed for reference selection, leading to decent naturalness and emotion diversity in cross-lingual synthetic speech. Experiments demonstrate the good design of METTS.
引用
收藏
页码:1506 / 1518
页数:13
相关论文
共 50 条
  • [31] X-E-Speech: Joint Training Framework of Non-Autoregressive Cross-lingual Emotional Text-to-Speech and Voice Conversion
    Guo, Houjian
    Liu, Chaoran
    Ishi, Carlos Toshinori
    Ishiguro, Hiroshi
    INTERSPEECH 2024, 2024, : 4983 - 4987
  • [32] Cross-Lingual Transfert Learning for Speech Emotion Recognition
    Baklouti, Imen
    Ben Ahmed, Olfa
    Baklouti, Raoudha
    Fernandez, Christine
    2024 IEEE 7TH INTERNATIONAL CONFERENCE ON ADVANCED TECHNOLOGIES, SIGNAL AND IMAGE PROCESSING, ATSIP 2024, 2024, : 559 - 563
  • [33] Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation
    Terashima, Ryo
    Yamamoto, Ryuichi
    Song, Eunwoo
    Shirahata, Yuma
    Yoon, Hyun-Wook
    Kim, Jae-Min
    Tachibana, Kentaro
    INTERSPEECH 2022, 2022, : 3018 - 3022
  • [34] DA-IICT Cross-lingual and Multilingual Corpora for Speaker Recognition
    Patil, Hemant A.
    Sitaram, Sunayana
    Sharma, Esha
    ICAPR 2009: SEVENTH INTERNATIONAL CONFERENCE ON ADVANCES IN PATTERN RECOGNITION, PROCEEDINGS, 2009, : 187 - 190
  • [35] SpeakerNet for Cross-lingual Text-Independent Speaker Verification
    Habib, Hafsa
    Tauseef, Huma
    Fahiem, Muhammad Abuzar
    Farhan, Saima
    Usman, Ghousia
    ARCHIVES OF ACOUSTICS, 2020, 45 (04) : 573 - 583
  • [36] Cross-Lingual Korean Speech-to-Text Summarization
    Yoon, HyoJeon
    Dinh Tuyen Hoang
    Ngoc Thanh Nguyen
    Hwang, Dosam
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2019, PT I, 2019, 11431 : 198 - 206
  • [37] Text-To-Speech with cross-lingual Neural Network-based grapheme-to-phoneme models
    Gonzalvo, Xavi
    Podsiadlo, Monika
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 765 - 769
  • [38] Cross-Lingual Speaker Discrimination Using Natural and Synthetic Speech
    Wester, Mirjam
    Liang, Hui
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2492 - 2495
  • [39] Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis
    Li, Tao
    Wang, Xinsheng
    Xie, Qicong
    Wang, Zhichao
    Jiang, Mingqi
    Xie, Lei
    INTERSPEECH 2022, 2022, : 5498 - 5502
  • [40] Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS
    Shin, Yookyung
    Lee, Younggun
    Jo, Suhee
    Hwang, Yeongtae
    Kim, Taesu
    INTERSPEECH 2022, 2022, : 2313 - 2317