ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

被引：2

作者：

Kang, Minki ^{[1
,2
]}

Han, Wooseok ^{[1
]}

Hwang, Sung Ju ^{[2
]}

Yang, Eunho ^{[1
,2
]}

机构：

[1] AITRICS, Seoul, South Korea

[2] Korea Adv Inst Sci & Technol, Daejeon, South Korea

来源：

INTERSPEECH 2023 | 2023年

关键词：

Text-to-Speech Synthesis; Emotional TTS;

D O I：

10.21437/Interspeech.2023-754

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Specifically, to enable a zero-shot adaptive TTS model to synthesize emotional speech, we propose domain adversarial learning and guidance methods on the diffusion model. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers. Samples are at https: //ZET- Speech.github.io/ZET-Speech-Demo/.

引用

页码：4339 / 4343

页数：5

共 47 条

[31] Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis
Yamagishi, Junichi
Nose, Takashi
Zen, Heiga
Ling, Zhen-Hua
Toda, Tomoki
Tokuda, Keiichi
King, Simon
Renals, Steve
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (06): : 1208 - 1230
[32] FINE-GRAINED STYLE CONTROL IN TRANSFORMER-BASED TEXT-TO-SPEECH SYNTHESIS
Chen, Li-Wei
Rudnicky, Alexander
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7907 - 7911
[33] SNAC: Speaker-Normalized Affine Coupling Layer in Flow-Based Architecture for Zero-Shot Multi-Speaker Text-to-Speech
Choi, Byoung Jin
Jeong, Myeonghun
Lee, Joun Yeop
Kim, Nam Soo
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2502 - 2506
[34] StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Li, Yinghao Aaron
Han, Cong
Raghavan, Vinay S.
Mischler, Gavin
Mesgarani, Nima
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[35] High fidelity zero shot speaker adaptation in text to speech synthesis with denoising diffusion GAN
Liu, Xiangchun
Ma, Xuan
Song, Wei
Zhang, Yanghao
Zhang, Yi
SCIENTIFIC REPORTS, 2025, 15 (01):
[36] LM-VC: Zero-Shot Voice Conversion via Speech Generation Based on Language Models
Wang Z.
Chen Y.
Xie L.
Tian Q.
Wang Y.
IEEE Signal Processing Letters, 2023, 30 : 1157 - 1161
[37] A Controllable Multi-Lingual Multi-Speaker Multi-Style Text-to-Speech Synthesis With Multivariate Information Minimization
Cheon, Sung Jun
Choi, Byoung Jin
Kim, Minchan
Lee, Hyeonseung
Kim, Nam Soo
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 55 - 59
[38] HMM-based distributed text-to-speech synthesis incorporating speaker-adaptive training
Jeon, Kwang Myung
Choi, Seung Ho
International Journal of Multimedia and Ubiquitous Engineering, 2014, 9 (05): : 107 - 119
[39] MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis
Guan, Wenhao
Li, Yishuang
Li, Tao
Huang, Hukai
Wang, Feng
Lin, Jiayan
Huang, Lingyan
Li, Lin
Hong, Qingyang
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18117 - 18125
[40] Adaptive End-to-End Text-to-Speech Synthesis Based on Error Correction Feedback from Humans
Fujii, Kazuki
Saito, Yuki
Saruwatari, Hiroshi
PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1702 - 1707

← 1 2 3 4 5 →