ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

被引:2
|
作者
Kang, Minki [1 ,2 ]
Han, Wooseok [1 ]
Hwang, Sung Ju [2 ]
Yang, Eunho [1 ,2 ]
机构
[1] AITRICS, Seoul, South Korea
[2] Korea Adv Inst Sci & Technol, Daejeon, South Korea
来源
关键词
Text-to-Speech Synthesis; Emotional TTS;
D O I
10.21437/Interspeech.2023-754
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Specifically, to enable a zero-shot adaptive TTS model to synthesize emotional speech, we propose domain adversarial learning and guidance methods on the diffusion model. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers. Samples are at https: //ZET- Speech.github.io/ZET-Speech-Demo/.
引用
收藏
页码:4339 / 4343
页数:5
相关论文
共 47 条
  • [1] VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in theWild
    Peng, Puyuan
    Huang, Po-Yao
    Le, Shang-Wen
    Mohamed, Abdelrahman
    Harwath, David
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 12442 - 12462
  • [2] Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
    Tang, Chuanxin
    Luo, Chong
    Zhao, Zhiyuan
    Yin, Dacheng
    Zhao, Yucheng
    Zeng, Wenjun
    INTERSPEECH 2021, 2021, : 3600 - 3604
  • [3] Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis
    Chen, Zhiyong
    Ai, Zhiqi
    Ma, Youxuan
    Li, Xinnuo
    Xu, Shugong
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2024, 2024 (01):
  • [4] EXACT PROSODY CLONING IN ZERO-SHOT MULTISPEAKER TEXT-TO-SPEECH
    Lux, Florian
    Koch, Julia
    Vu, Ngoc Thang
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 962 - 969
  • [5] XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
    Casanova, Edresson
    Davis, Kelly
    Goelge, Eren
    Goekncar, Gorkem
    Gulea, Iulian
    Hart, Logan
    Aljafari, Aya
    Meyer, Joshua
    Morais, Reuben
    Olayemi, Samuel
    Weber, Julian
    INTERSPEECH 2024, 2024, : 4978 - 4982
  • [6] ZERO-SHOT TEXT-TO-SPEECH SYNTHESIS CONDITIONED USING SELF-SUPERVISED SPEECH REPRESENTATION MODEL
    Fujita, Kenichi
    Ashihara, Takanori
    Kanagawa, Hiroki
    Moriya, Takafumi
    Ijima, Yusuke
    2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [7] StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis
    Chene, Zhiyong
    Li, Xinnuo
    Ai, Zhiqi
    Xu, Shugong
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XI, 2025, 15041 : 263 - 277
  • [8] StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis
    Li, Yinghao Aaron
    Han, Cong
    Mesgarani, Nima
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2025, 19 (01) : 283 - 296
  • [9] Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers
    Azizah, Kurniawati
    IEEE ACCESS, 2024, 12 : 63528 - 63547
  • [10] AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios
    Wu, Yihan
    Tan, Xu
    Li, Bohan
    He, Lei
    Zhao, Sheng
    Song, Ruihua
    Qin, Tao
    Liu, Tie-Yan
    INTERSPEECH 2022, 2022, : 2568 - 2572