ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

被引:2
|
作者
Kang, Minki [1 ,2 ]
Han, Wooseok [1 ]
Hwang, Sung Ju [2 ]
Yang, Eunho [1 ,2 ]
机构
[1] AITRICS, Seoul, South Korea
[2] Korea Adv Inst Sci & Technol, Daejeon, South Korea
来源
关键词
Text-to-Speech Synthesis; Emotional TTS;
D O I
10.21437/Interspeech.2023-754
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Specifically, to enable a zero-shot adaptive TTS model to synthesize emotional speech, we propose domain adversarial learning and guidance methods on the diffusion model. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers. Samples are at https: //ZET- Speech.github.io/ZET-Speech-Demo/.
引用
收藏
页码:4339 / 4343
页数:5
相关论文
共 47 条
  • [31] Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis
    Yamagishi, Junichi
    Nose, Takashi
    Zen, Heiga
    Ling, Zhen-Hua
    Toda, Tomoki
    Tokuda, Keiichi
    King, Simon
    Renals, Steve
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (06): : 1208 - 1230
  • [32] FINE-GRAINED STYLE CONTROL IN TRANSFORMER-BASED TEXT-TO-SPEECH SYNTHESIS
    Chen, Li-Wei
    Rudnicky, Alexander
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7907 - 7911
  • [33] SNAC: Speaker-Normalized Affine Coupling Layer in Flow-Based Architecture for Zero-Shot Multi-Speaker Text-to-Speech
    Choi, Byoung Jin
    Jeong, Myeonghun
    Lee, Joun Yeop
    Kim, Nam Soo
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2502 - 2506
  • [34] StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
    Li, Yinghao Aaron
    Han, Cong
    Raghavan, Vinay S.
    Mischler, Gavin
    Mesgarani, Nima
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [35] High fidelity zero shot speaker adaptation in text to speech synthesis with denoising diffusion GAN
    Liu, Xiangchun
    Ma, Xuan
    Song, Wei
    Zhang, Yanghao
    Zhang, Yi
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [36] LM-VC: Zero-Shot Voice Conversion via Speech Generation Based on Language Models
    Wang Z.
    Chen Y.
    Xie L.
    Tian Q.
    Wang Y.
    IEEE Signal Processing Letters, 2023, 30 : 1157 - 1161
  • [37] A Controllable Multi-Lingual Multi-Speaker Multi-Style Text-to-Speech Synthesis With Multivariate Information Minimization
    Cheon, Sung Jun
    Choi, Byoung Jin
    Kim, Minchan
    Lee, Hyeonseung
    Kim, Nam Soo
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 55 - 59
  • [38] HMM-based distributed text-to-speech synthesis incorporating speaker-adaptive training
    Jeon, Kwang Myung
    Choi, Seung Ho
    International Journal of Multimedia and Ubiquitous Engineering, 2014, 9 (05): : 107 - 119
  • [39] MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis
    Guan, Wenhao
    Li, Yishuang
    Li, Tao
    Huang, Hukai
    Wang, Feng
    Lin, Jiayan
    Huang, Lingyan
    Li, Lin
    Hong, Qingyang
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18117 - 18125
  • [40] Adaptive End-to-End Text-to-Speech Synthesis Based on Error Correction Feedback from Humans
    Fujii, Kazuki
    Saito, Yuki
    Saruwatari, Hiroshi
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1702 - 1707