ED-TTS: MULTI-SCALE EMOTION MODELING USING CROSS-DOMAIN EMOTION DIARIZATION FOR EMOTIONAL SPEECH SYNTHESIS

被引：2

作者：

Tang, Haobin ^{[1
,2
]}

Zhang, Xulong ^{[1
]}

Cheng, Ning ^{[1
]}

Xiao, Jing ^{[1
]}

Wang, Jianzong ^{[1
]}

机构：

[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China

[2] Univ Sci & Technol China, Hefei, Peoples R China

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024) | 2024年

关键词：

emotional speech synthesis; speech emotion diarization; diffusion denoising probabilistic model;

D O I：

10.1109/ICASSP48485.2024.10446467

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our proposed approach integrates the utterance-level emotion embedding extracted by SER with fine-grained frame-level emotion embedding obtained from SED. These embeddings are used to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, we employ cross-domain SED to accurately predict soft labels, addressing the challenge of a scarcity of fine-grained emotion-annotated datasets for supervising emotional TTS training.

引用

页码：12146 / 12150

页数：5

共 50 条

[31] Affective Image Classification Using Multi-scale Emotion Factorization Features
Chang, Le
Chen, Yufeng
Li, Fengxia
Sun, Meiling
Yang, Chenguang
2016 INTERNATIONAL CONFERENCE ON VIRTUAL REALITY AND VISUALIZATION (ICVRV 2016), 2016, : 170 - 174
[32] Cross-Speaker Emotion Transfer Through Information Perturbation in Emotional Speech Synthesis
Lei, Yi
Yang, Shan
Zhu, Xinfa
Xie, Lei
Su, Dan
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1948 - 1952
[33] DMDnet: A decoupled multi-scale discriminant model for cross-domain fish detection
Zhao, Tengyun
Zhang, Guoxu
Zhong, Ping
Shen, Zhencai
BIOSYSTEMS ENGINEERING, 2023, 234 : 32 - 45
[34] Speech Emotion Recognition Using Multi-Scale Global-Local Representation Learning with Feature Pyramid Network
Wang, Yuhua
Huang, Jianxing
Zhao, Zhengdao
Lan, Haiyan
Zhang, Xinjia
APPLIED SCIENCES-BASEL, 2024, 14 (24):
[35] HMM-based emotional speech synthesis using average emotion model
Qin, Long
Ling, Zhen-Hua
Wu, Yi-Jian
Zhang, Bu-Fan
Wang, Ren-Hua
CHINESE SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, 2006, 4274 : 233 - +
[36] Speech emotion recognition based on multi-dimensional feature extraction and multi-scale feature fusion
Yu, Lingli
Xu, Fengjun
Qu, Yundong
Zhou, Kaijun
APPLIED ACOUSTICS, 2024, 216
[37] Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism
Feng, Lin
Liu, Lu-Yao
Liu, Sheng-Lan
Zhou, Jian
Yang, Han-Qing
Yang, Jie
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (19) : 28917 - 28935
[38] Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism
Lin Feng
Lu-Yao Liu
Sheng-Lan Liu
Jian Zhou
Han-Qing Yang
Jie Yang
Multimedia Tools and Applications, 2023, 82 : 28917 - 28935
[39] EEG emotion recognition approach using multi-scale convolution and feature fusion
Zhang, Yong
Shan, Qingguo
Chen, Wenyun
Liu, Wenzhe
VISUAL COMPUTER, 2024, : 4157 - 4169
[40] A new adaptive multi-scale attention adversarial network for cross-domain fault diagnosis
Kong, Lingtan
Wang, Jinrui
Wang, Dawei
Bao, Huaiqian
Zhang, Zongzhen
Han, Baokun
Man, Xuhao
Qin, Ranran
Yang, Xiaoli
KNOWLEDGE-BASED SYSTEMS, 2025, 311

← 1 2 3 4 5 →