ED-TTS: MULTI-SCALE EMOTION MODELING USING CROSS-DOMAIN EMOTION DIARIZATION FOR EMOTIONAL SPEECH SYNTHESIS

被引:2
|
作者
Tang, Haobin [1 ,2 ]
Zhang, Xulong [1 ]
Cheng, Ning [1 ]
Xiao, Jing [1 ]
Wang, Jianzong [1 ]
机构
[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China
[2] Univ Sci & Technol China, Hefei, Peoples R China
关键词
emotional speech synthesis; speech emotion diarization; diffusion denoising probabilistic model;
D O I
10.1109/ICASSP48485.2024.10446467
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our proposed approach integrates the utterance-level emotion embedding extracted by SER with fine-grained frame-level emotion embedding obtained from SED. These embeddings are used to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, we employ cross-domain SED to accurately predict soft labels, addressing the challenge of a scarcity of fine-grained emotion-annotated datasets for supervising emotional TTS training.
引用
收藏
页码:12146 / 12150
页数:5
相关论文
共 50 条
  • [41] Adaptive multi-scale attention convolution neural network for cross-domain fault diagnosis
    Shao, Xiaorui
    Kim, Chang-Soo
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 236
  • [42] Cross-Domain Object Detection Algorithm Based on Multi-scale Mask Classification Domain Adaptive Network
    Hu J.
    Xu B.
    Xiong Z.
    Chang M.
    Guo D.
    Xie L.
    Qiche Gongcheng/Automotive Engineering, 2022, 44 (09): : 1327 - 1338
  • [43] Cross-corpus speech emotion recognition using subspace learning and domain adaption
    Xuan Cao
    Maoshen Jia
    Jiawei Ru
    Tun-wen Pai
    EURASIP Journal on Audio, Speech, and Music Processing, 2022
  • [44] Cross-corpus speech emotion recognition using subspace learning and domain adaption
    Cao, Xuan
    Jia, Maoshen
    Ru, Jiawei
    Pai, Tun-wen
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2022, 2022 (01)
  • [45] Cross-Domain Intelligent Fault Diagnosis Method of Rotating Machinery Using Multi-Scale Transfer Fuzzy Entropy
    Zheng Dangdang
    Han, Bing
    Liu, Geng
    Li, Yongbo
    Yu, Huangchao
    IEEE ACCESS, 2021, 9 : 95481 - 95492
  • [46] Multi-view domain adaption based multi-scale convolutional conditional invertible discriminator for cross-subject electroencephalogram emotion recognition
    Babu, S. Sivasaravana
    Venkatesan, Prabhu
    Velusamy, Parthasarathy
    Ganesan, Saravana Kumar
    COGNITIVE NEURODYNAMICS, 2025, 19 (01)
  • [47] EEG-based Emotion Recognition Using Multi-scale Window Deep Forest
    Yao, Huifang
    He, Hong
    Wang, Shilong
    Xie, Zhangping
    2019 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (IEEE SSCI 2019), 2019, : 381 - 386
  • [48] MULTI-SCALE BLOCKS BASED IMAGE EMOTION CLASSIFICATION USING MULTIPLE INSTANCE LEARNING
    Rao, Tianrong
    Xu, Min
    Liu, Huiying
    Wang, Jinqiao
    Burnett, Ian
    2016 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2016, : 634 - 638
  • [49] Progressive learning with multi-scale attention network for cross-domain vehicle re-identification
    Wang, Yang
    Peng, Jinjia
    Wang, Huibing
    Wang, Meng
    SCIENCE CHINA-INFORMATION SCIENCES, 2022, 65 (06)
  • [50] Progressive learning with multi-scale attention network for cross-domain vehicle re-identification
    Yang Wang
    Jinjia Peng
    Huibing Wang
    Meng Wang
    Science China Information Sciences, 2022, 65