ED-TTS: MULTI-SCALE EMOTION MODELING USING CROSS-DOMAIN EMOTION DIARIZATION FOR EMOTIONAL SPEECH SYNTHESIS

被引:2
|
作者
Tang, Haobin [1 ,2 ]
Zhang, Xulong [1 ]
Cheng, Ning [1 ]
Xiao, Jing [1 ]
Wang, Jianzong [1 ]
机构
[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China
[2] Univ Sci & Technol China, Hefei, Peoples R China
来源
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024) | 2024年
关键词
emotional speech synthesis; speech emotion diarization; diffusion denoising probabilistic model;
D O I
10.1109/ICASSP48485.2024.10446467
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our proposed approach integrates the utterance-level emotion embedding extracted by SER with fine-grained frame-level emotion embedding obtained from SED. These embeddings are used to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, we employ cross-domain SED to accurately predict soft labels, addressing the challenge of a scarcity of fine-grained emotion-annotated datasets for supervising emotional TTS training.
引用
收藏
页码:12146 / 12150
页数:5
相关论文
共 50 条
  • [31] Affective Image Classification Using Multi-scale Emotion Factorization Features
    Chang, Le
    Chen, Yufeng
    Li, Fengxia
    Sun, Meiling
    Yang, Chenguang
    2016 INTERNATIONAL CONFERENCE ON VIRTUAL REALITY AND VISUALIZATION (ICVRV 2016), 2016, : 170 - 174
  • [32] Cross-Speaker Emotion Transfer Through Information Perturbation in Emotional Speech Synthesis
    Lei, Yi
    Yang, Shan
    Zhu, Xinfa
    Xie, Lei
    Su, Dan
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1948 - 1952
  • [33] DMDnet: A decoupled multi-scale discriminant model for cross-domain fish detection
    Zhao, Tengyun
    Zhang, Guoxu
    Zhong, Ping
    Shen, Zhencai
    BIOSYSTEMS ENGINEERING, 2023, 234 : 32 - 45
  • [34] Speech Emotion Recognition Using Multi-Scale Global-Local Representation Learning with Feature Pyramid Network
    Wang, Yuhua
    Huang, Jianxing
    Zhao, Zhengdao
    Lan, Haiyan
    Zhang, Xinjia
    APPLIED SCIENCES-BASEL, 2024, 14 (24):
  • [35] HMM-based emotional speech synthesis using average emotion model
    Qin, Long
    Ling, Zhen-Hua
    Wu, Yi-Jian
    Zhang, Bu-Fan
    Wang, Ren-Hua
    CHINESE SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, 2006, 4274 : 233 - +
  • [36] Speech emotion recognition based on multi-dimensional feature extraction and multi-scale feature fusion
    Yu, Lingli
    Xu, Fengjun
    Qu, Yundong
    Zhou, Kaijun
    APPLIED ACOUSTICS, 2024, 216
  • [37] Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism
    Feng, Lin
    Liu, Lu-Yao
    Liu, Sheng-Lan
    Zhou, Jian
    Yang, Han-Qing
    Yang, Jie
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (19) : 28917 - 28935
  • [38] Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism
    Lin Feng
    Lu-Yao Liu
    Sheng-Lan Liu
    Jian Zhou
    Han-Qing Yang
    Jie Yang
    Multimedia Tools and Applications, 2023, 82 : 28917 - 28935
  • [39] EEG emotion recognition approach using multi-scale convolution and feature fusion
    Zhang, Yong
    Shan, Qingguo
    Chen, Wenyun
    Liu, Wenzhe
    VISUAL COMPUTER, 2024, : 4157 - 4169
  • [40] A new adaptive multi-scale attention adversarial network for cross-domain fault diagnosis
    Kong, Lingtan
    Wang, Jinrui
    Wang, Dawei
    Bao, Huaiqian
    Zhang, Zongzhen
    Han, Baokun
    Man, Xuhao
    Qin, Ranran
    Yang, Xiaoli
    KNOWLEDGE-BASED SYSTEMS, 2025, 311