ED-TTS: MULTI-SCALE EMOTION MODELING USING CROSS-DOMAIN EMOTION DIARIZATION FOR EMOTIONAL SPEECH SYNTHESIS

被引:2
|
作者
Tang, Haobin [1 ,2 ]
Zhang, Xulong [1 ]
Cheng, Ning [1 ]
Xiao, Jing [1 ]
Wang, Jianzong [1 ]
机构
[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China
[2] Univ Sci & Technol China, Hefei, Peoples R China
关键词
emotional speech synthesis; speech emotion diarization; diffusion denoising probabilistic model;
D O I
10.1109/ICASSP48485.2024.10446467
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our proposed approach integrates the utterance-level emotion embedding extracted by SER with fine-grained frame-level emotion embedding obtained from SED. These embeddings are used to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, we employ cross-domain SED to accurately predict soft labels, addressing the challenge of a scarcity of fine-grained emotion-annotated datasets for supervising emotional TTS training.
引用
收藏
页码:12146 / 12150
页数:5
相关论文
共 50 条
  • [1] EMOTION CONTROLLABLE SPEECH SYNTHESIS USING EMOTION-UNLABELED DATASET WITH THE ASSISTANCE OF CROSS-DOMAIN SPEECH EMOTION RECOGNITION
    Cai, Xiong
    Dai, Dongyang
    Wu, Zhiyong
    Li, Xiang
    Li, Jingbei
    Meng, Helen
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5734 - 5738
  • [2] MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis
    Lei, Yi
    Yang, Shan
    Wang, Xinsheng
    Xie, Lei
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 853 - 864
  • [3] Multi-Source Discriminant Subspace Alignment for Cross-Domain Speech Emotion Recognition
    Li, Shaokai
    Song, Peng
    Zheng, Wenming
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2448 - 2460
  • [4] EFFICIENT SPEECH EMOTION RECOGNITION USING MULTI-SCALE CNN AND ATTENTION
    Peng, Zixuan
    Lu, Yu
    Pan, Shengfeng
    Liu, Yunfeng
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3020 - 3024
  • [5] A Lightweight Multi-Scale Model for Speech Emotion Recognition
    Li, Haoming
    Zhao, Daqi
    Wang, Jingwen
    Wang, Deqiang
    IEEE ACCESS, 2024, 12 : 130228 - 130240
  • [6] Multi-Scale Temporal Transformer For Speech Emotion Recognition
    Li, Zhipeng
    Xing, Xiaofen
    Fang, Yuanbo
    Zhang, Weibin
    Fan, Hengsheng
    Xu, Xiangmin
    INTERSPEECH 2023, 2023, : 3652 - 3656
  • [7] Unsupervised Transfer Components Learning for Cross-Domain Speech Emotion Recognition
    Jiang, Shenjie
    Song, Peng
    Li, Shaokai
    Zhao, Keke
    Zheng, Wenming
    INTERSPEECH 2023, 2023, : 4538 - 4542
  • [8] Multi-scale discrepancy adversarial network for cross-corpus speech emotion recognition
    Wanlu ZHENG
    Wenming ZHENG
    Yuan ZONG
    虚拟现实与智能硬件(中英文), 2021, 3 (01) : 65 - 75
  • [9] A Multi-scale Fusion Framework for Bimodal Speech Emotion Recognition
    Chen, Ming
    Zhao, Xudong
    INTERSPEECH 2020, 2020, : 374 - 378
  • [10] Common Discriminative Latent Space Learning for Cross-Domain Speech Emotion Recognition
    Fu, Siqi
    Song, Peng
    Wang, Hao
    Liu, Zhaowei
    Zheng, Wenming
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024,