ED-TTS: MULTI-SCALE EMOTION MODELING USING CROSS-DOMAIN EMOTION DIARIZATION FOR EMOTIONAL SPEECH SYNTHESIS

被引:2
|
作者
Tang, Haobin [1 ,2 ]
Zhang, Xulong [1 ]
Cheng, Ning [1 ]
Xiao, Jing [1 ]
Wang, Jianzong [1 ]
机构
[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China
[2] Univ Sci & Technol China, Hefei, Peoples R China
关键词
emotional speech synthesis; speech emotion diarization; diffusion denoising probabilistic model;
D O I
10.1109/ICASSP48485.2024.10446467
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our proposed approach integrates the utterance-level emotion embedding extracted by SER with fine-grained frame-level emotion embedding obtained from SED. These embeddings are used to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, we employ cross-domain SED to accurately predict soft labels, addressing the challenge of a scarcity of fine-grained emotion-annotated datasets for supervising emotional TTS training.
引用
收藏
页码:12146 / 12150
页数:5
相关论文
共 50 条
  • [21] Learning multi-scale features for speech emotion recognition with connection attention mechanism
    Chen, Zengzhao
    Li, Jiawen
    Liu, Hai
    Wang, Xuyang
    Wang, Hu
    Zheng, Qiuyu
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 214
  • [22] Multi-Source and Multi-Representation Adaptation for Cross-Domain Electroencephalography Emotion Recognition
    Cao, Jiangsheng
    He, Xueqin
    Yang, Chenhui
    Chen, Sifang
    Li, Zhangyu
    Wang, Zhanxiang
    FRONTIERS IN PSYCHOLOGY, 2022, 12
  • [23] Real-Time End-to-End Speech Emotion Recognition with Cross-Domain Adaptation
    Wongpatikaseree, Konlakorn
    Singkul, Sattaya
    Hnoohom, Narit
    Yuenyong, Sumeth
    BIG DATA AND COGNITIVE COMPUTING, 2022, 6 (03)
  • [24] A Multi-scale Feature Adaptation ConvNeXt for Cross-Domain Fault Diagnosis
    Huang, Zhe
    Lan, Qing
    Li, Mingxuan
    Wen, Zhihui
    He, Wangpeng
    NEURAL COMPUTING FOR ADVANCED APPLICATIONS, NCAA 2024, PT III, 2025, 2183 : 339 - 353
  • [25] Multi-Scale Adversarial Cross-Domain Detection with Robust Discriminative Learning
    Pan, YoungSun
    Ma, Andy J.
    Gao, Yuan
    Wang, JinPeng
    Lin, Yiqi
    2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1313 - 1321
  • [26] A Cross-Domain Exploration of Audio and Textual Data for Multi-Modal Emotion Detection
    Haque, Mohd Ariful
    George, Roy
    Rifat, Rakib Hossain
    Uddin, Md Shihab
    Kamal, Marufa
    Gupta, Kishor Datta
    17TH ACM INTERNATIONAL CONFERENCE ON PERVASIVE TECHNOLOGIES RELATED TO ASSISTIVE ENVIRONMENTS, PETRA 2024, 2024, : 375 - 381
  • [27] Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework
    Liu, Yang
    Sun, Haoqin
    Guan, Wenbo
    Xia, Yuqi
    Zhao, Zhen
    SPEECH COMMUNICATION, 2022, 139 : 1 - 9
  • [28] SPEECH EMOTION RECOGNITION WITH GLOBAL-AWARE FUSION ON MULTI-SCALE FEATURE REPRESENTATION
    Zhu, Wenjing
    Li, Xiang
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6437 - 6441
  • [29] Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework
    Liu, Yang
    Sun, Haoqin
    Guan, Wenbo
    Xia, Yuqi
    Zhao, Zhen
    Speech Communication, 2022, 139 : 1 - 9
  • [30] SpeechEQ: Speech Emotion Recognition based on Multi-scale Unified Datasets and Multitask Learning
    Kang, Zuheng
    Peng, Junqing
    Wang, Jianzong
    Xiao, Jing
    INTERSPEECH 2022, 2022, : 4745 - 4749