ED-TTS: MULTI-SCALE EMOTION MODELING USING CROSS-DOMAIN EMOTION DIARIZATION FOR EMOTIONAL SPEECH SYNTHESIS

被引：2

作者：

Tang, Haobin ^{[1
,2
]}

Zhang, Xulong ^{[1
]}

Cheng, Ning ^{[1
]}

Xiao, Jing ^{[1
]}

Wang, Jianzong ^{[1
]}

机构：

[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China

[2] Univ Sci & Technol China, Hefei, Peoples R China

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024) | 2024年

关键词：

emotional speech synthesis; speech emotion diarization; diffusion denoising probabilistic model;

D O I：

10.1109/ICASSP48485.2024.10446467

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our proposed approach integrates the utterance-level emotion embedding extracted by SER with fine-grained frame-level emotion embedding obtained from SED. These embeddings are used to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, we employ cross-domain SED to accurately predict soft labels, addressing the challenge of a scarcity of fine-grained emotion-annotated datasets for supervising emotional TTS training.

引用

页码：12146 / 12150

页数：5

共 50 条

[41] Adaptive multi-scale attention convolution neural network for cross-domain fault diagnosis
Shao, Xiaorui
Kim, Chang-Soo
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 236
[42] Cross-Domain Object Detection Algorithm Based on Multi-scale Mask Classification Domain Adaptive Network
Hu J.
Xu B.
Xiong Z.
Chang M.
Guo D.
Xie L.
Qiche Gongcheng/Automotive Engineering, 2022, 44 (09): : 1327 - 1338
[43] Cross-corpus speech emotion recognition using subspace learning and domain adaption
Xuan Cao
Maoshen Jia
Jiawei Ru
Tun-wen Pai
EURASIP Journal on Audio, Speech, and Music Processing, 2022
[44] Cross-corpus speech emotion recognition using subspace learning and domain adaption
Cao, Xuan
Jia, Maoshen
Ru, Jiawei
Pai, Tun-wen
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2022, 2022 (01)
[45] Cross-Domain Intelligent Fault Diagnosis Method of Rotating Machinery Using Multi-Scale Transfer Fuzzy Entropy
Zheng Dangdang
Han, Bing
Liu, Geng
Li, Yongbo
Yu, Huangchao
IEEE ACCESS, 2021, 9 : 95481 - 95492
[46] Multi-view domain adaption based multi-scale convolutional conditional invertible discriminator for cross-subject electroencephalogram emotion recognition
Babu, S. Sivasaravana
Venkatesan, Prabhu
Velusamy, Parthasarathy
Ganesan, Saravana Kumar
COGNITIVE NEURODYNAMICS, 2025, 19 (01)
[47] EEG-based Emotion Recognition Using Multi-scale Window Deep Forest
Yao, Huifang
He, Hong
Wang, Shilong
Xie, Zhangping
2019 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (IEEE SSCI 2019), 2019, : 381 - 386
[48] MULTI-SCALE BLOCKS BASED IMAGE EMOTION CLASSIFICATION USING MULTIPLE INSTANCE LEARNING
Rao, Tianrong
Xu, Min
Liu, Huiying
Wang, Jinqiao
Burnett, Ian
2016 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2016, : 634 - 638
[49] Progressive learning with multi-scale attention network for cross-domain vehicle re-identification
Wang, Yang
Peng, Jinjia
Wang, Huibing
Wang, Meng
SCIENCE CHINA-INFORMATION SCIENCES, 2022, 65 (06)
[50] Progressive learning with multi-scale attention network for cross-domain vehicle re-identification
Yang Wang
Jinjia Peng
Huibing Wang
Meng Wang
Science China Information Sciences, 2022, 65

← 1 2 3 4 5 →