ED-TTS: MULTI-SCALE EMOTION MODELING USING CROSS-DOMAIN EMOTION DIARIZATION FOR EMOTIONAL SPEECH SYNTHESIS

被引：2

作者：

Tang, Haobin ^{[1
,2
]}

Zhang, Xulong ^{[1
]}

Cheng, Ning ^{[1
]}

Xiao, Jing ^{[1
]}

Wang, Jianzong ^{[1
]}

机构：

[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China

[2] Univ Sci & Technol China, Hefei, Peoples R China

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024) | 2024年

关键词：

emotional speech synthesis; speech emotion diarization; diffusion denoising probabilistic model;

D O I：

10.1109/ICASSP48485.2024.10446467

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our proposed approach integrates the utterance-level emotion embedding extracted by SER with fine-grained frame-level emotion embedding obtained from SED. These embeddings are used to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, we employ cross-domain SED to accurately predict soft labels, addressing the challenge of a scarcity of fine-grained emotion-annotated datasets for supervising emotional TTS training.

引用

页码：12146 / 12150

页数：5

共 50 条

[21] Learning multi-scale features for speech emotion recognition with connection attention mechanism
Chen, Zengzhao
Li, Jiawen
Liu, Hai
Wang, Xuyang
Wang, Hu
Zheng, Qiuyu
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 214
[22] Multi-Source and Multi-Representation Adaptation for Cross-Domain Electroencephalography Emotion Recognition
Cao, Jiangsheng
He, Xueqin
Yang, Chenhui
Chen, Sifang
Li, Zhangyu
Wang, Zhanxiang
FRONTIERS IN PSYCHOLOGY, 2022, 12
[23] Real-Time End-to-End Speech Emotion Recognition with Cross-Domain Adaptation
Wongpatikaseree, Konlakorn
Singkul, Sattaya
Hnoohom, Narit
Yuenyong, Sumeth
BIG DATA AND COGNITIVE COMPUTING, 2022, 6 (03)
[24] A Multi-scale Feature Adaptation ConvNeXt for Cross-Domain Fault Diagnosis
Huang, Zhe
Lan, Qing
Li, Mingxuan
Wen, Zhihui
He, Wangpeng
NEURAL COMPUTING FOR ADVANCED APPLICATIONS, NCAA 2024, PT III, 2025, 2183 : 339 - 353
[25] Multi-Scale Adversarial Cross-Domain Detection with Robust Discriminative Learning
Pan, YoungSun
Ma, Andy J.
Gao, Yuan
Wang, JinPeng
Lin, Yiqi
2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1313 - 1321
[26] A Cross-Domain Exploration of Audio and Textual Data for Multi-Modal Emotion Detection
Haque, Mohd Ariful
George, Roy
Rifat, Rakib Hossain
Uddin, Md Shihab
Kamal, Marufa
Gupta, Kishor Datta
17TH ACM INTERNATIONAL CONFERENCE ON PERVASIVE TECHNOLOGIES RELATED TO ASSISTIVE ENVIRONMENTS, PETRA 2024, 2024, : 375 - 381
[27] Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework
Liu, Yang
Sun, Haoqin
Guan, Wenbo
Xia, Yuqi
Zhao, Zhen
SPEECH COMMUNICATION, 2022, 139 : 1 - 9
[28] SPEECH EMOTION RECOGNITION WITH GLOBAL-AWARE FUSION ON MULTI-SCALE FEATURE REPRESENTATION
Zhu, Wenjing
Li, Xiang
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6437 - 6441
[29] Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework
Liu, Yang
Sun, Haoqin
Guan, Wenbo
Xia, Yuqi
Zhao, Zhen
Speech Communication, 2022, 139 : 1 - 9
[30] SpeechEQ: Speech Emotion Recognition based on Multi-scale Unified Datasets and Multitask Learning
Kang, Zuheng
Peng, Junqing
Wang, Jianzong
Xiao, Jing
INTERSPEECH 2022, 2022, : 4745 - 4749

← 1 2 3 4 5 →