ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

被引：2

作者：

Kang, Minki ^{[1
,2
]}

Han, Wooseok ^{[1
]}

Hwang, Sung Ju ^{[2
]}

Yang, Eunho ^{[1
,2
]}

机构：

[1] AITRICS, Seoul, South Korea

[2] Korea Adv Inst Sci & Technol, Daejeon, South Korea

来源：

INTERSPEECH 2023 | 2023年

关键词：

Text-to-Speech Synthesis; Emotional TTS;

D O I：

10.21437/Interspeech.2023-754

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Specifically, to enable a zero-shot adaptive TTS model to synthesize emotional speech, we propose domain adversarial learning and guidance methods on the diffusion model. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers. Samples are at https: //ZET- Speech.github.io/ZET-Speech-Demo/.

引用

页码：4339 / 4343

页数：5

共 47 条

[1] VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in theWild
Peng, Puyuan
Huang, Po-Yao
Le, Shang-Wen
Mohamed, Abdelrahman
Harwath, David
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 12442 - 12462
[2] Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
Tang, Chuanxin
Luo, Chong
Zhao, Zhiyuan
Yin, Dacheng
Zhao, Yucheng
Zeng, Wenjun
INTERSPEECH 2021, 2021, : 3600 - 3604
[3] Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis
Chen, Zhiyong
Ai, Zhiqi
Ma, Youxuan
Li, Xinnuo
Xu, Shugong
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2024, 2024 (01):
[4] EXACT PROSODY CLONING IN ZERO-SHOT MULTISPEAKER TEXT-TO-SPEECH
Lux, Florian
Koch, Julia
Vu, Ngoc Thang
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 962 - 969
[5] XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
Casanova, Edresson
Davis, Kelly
Goelge, Eren
Goekncar, Gorkem
Gulea, Iulian
Hart, Logan
Aljafari, Aya
Meyer, Joshua
Morais, Reuben
Olayemi, Samuel
Weber, Julian
INTERSPEECH 2024, 2024, : 4978 - 4982
[6] ZERO-SHOT TEXT-TO-SPEECH SYNTHESIS CONDITIONED USING SELF-SUPERVISED SPEECH REPRESENTATION MODEL
Fujita, Kenichi
Ashihara, Takanori
Kanagawa, Hiroki
Moriya, Takafumi
Ijima, Yusuke
2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
[7] StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis
Chene, Zhiyong
Li, Xinnuo
Ai, Zhiqi
Xu, Shugong
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XI, 2025, 15041 : 263 - 277
[8] StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis
Li, Yinghao Aaron
Han, Cong
Mesgarani, Nima
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2025, 19 (01) : 283 - 296
[9] Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers
Azizah, Kurniawati
IEEE ACCESS, 2024, 12 : 63528 - 63547
[10] AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios
Wu, Yihan
Tan, Xu
Li, Bohan
He, Lei
Zhao, Sheng
Song, Ruihua
Qin, Tao
Liu, Tie-Yan
INTERSPEECH 2022, 2022, : 2568 - 2572

← 1 2 3 4 5 →