MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis

被引:32
|
作者
Lei, Yi [1 ]
Yang, Shan [2 ]
Wang, Xinsheng [3 ,4 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, ASGO, Audio Speech & Langauge Proc Grp, Xian 710072, Peoples R China
[2] Tencent AI Lab, Beijing 100086, Peoples R China
[3] Xi An Jiao Tong Univ, Sch Software Engn, Xian 710049, Peoples R China
[4] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China
关键词
Speech synthesis; Predictive models; Analytical models; Virtual assistants; Speech; Feature extraction; Decoding; emotional speech synthesis; emotion strengths; multi-scale; PROSODY;
D O I
10.1109/TASLP.2022.3145293
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style and thus ignores the multi-scale nature of speech prosody. In this paper, we propose MsEmoTTS, a multi-scale emotional speech synthesis framework, to model the emotion from different levels. Specifically, the proposed method is a typical attention-based sequence-to-sequence model and with proposed three modules, including global-level emotion presenting module (GM), utterance-level emotion presenting module (UM), and local-level emotion presenting module (LM), to model the global emotion category, utterance-level emotion variation, and syllable-level emotion strength, respectively. In addition to modeling the emotion from different levels, the proposed method also allows us to synthesize emotional speech in different ways, i.e., transferring the emotion from reference audio, predicting the emotion from input text, and controlling the emotion strength manually. Extensive experiments conducted on a Chinese emotional speech corpus demonstrate that the proposed method outperforms the compared reference audio-based and text-based emotional speech synthesis methods on the emotion transfer speech synthesis and text-based emotion prediction speech synthesis respectively. Besides, the experiments also show that the proposed method can control the emotion expressions flexibly. Detailed analysis shows the effectiveness of each module and the good design of the proposed method.
引用
收藏
页码:853 / 864
页数:12
相关论文
共 50 条
  • [31] MSStyleTTS: Multi-Scale Style Modeling With Hierarchical Context Information for Expressive Speech Synthesis
    Lei S.
    Zhou Y.
    Chen L.
    Wu Z.
    Wu X.
    Kang S.
    Meng H.
    IEEE/ACM Transactions on Audio Speech and Language Processing, 2023, 31 : 3290 - 3303
  • [32] CM-TCN: Channel-Aware Multi-scale Temporal Convolutional Networks for Speech Emotion Recognition
    Wu, Tianqi
    Wang, Liejun
    Zhang, Jiang
    NEURAL INFORMATION PROCESSING, ICONIP 2023, PT III, 2024, 14449 : 459 - 476
  • [33] Speech Emotion Recognition Using Multi-Scale Global-Local Representation Learning with Feature Pyramid Network
    Wang, Yuhua
    Huang, Jianxing
    Zhao, Zhengdao
    Lan, Haiyan
    Zhang, Xinjia
    APPLIED SCIENCES-BASEL, 2024, 14 (24):
  • [34] Multi-scale Generative Adversarial Networks for Speech Enhancement
    Li, Yihang
    Jiang, Ting
    Qin, Shan
    2019 7TH IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (IEEE GLOBALSIP), 2019,
  • [35] MULTI-SCALE OCTAVE CONVOLUTIONS FOR ROBUST SPEECH RECOGNITION
    Rownicka, Joanna
    Bell, Peter
    Renals, Steve
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7019 - 7023
  • [36] MSIPA: Multi-Scale Interval Pattern-Aware Network for ICU Transfer Prediction
    Lee, Wu
    Shi, Yuliang
    Sun, Hongfeng
    Cheng, Lin
    Zhang, Kun
    Wang, Xinjun
    Chen, Zhiyong
    ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2022, 16 (01)
  • [37] Multi-scale invariant fields: estimation and prediction
    Ghasemi, H.
    Rezakhah, S.
    Modarresi, N.
    JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2020, 2020 (07):
  • [38] MULTI-SCALE PREDICTION NETWORK FOR LUNG SEGMENTATION
    Gu, Yuchong
    Lai, Yaoming
    Xie, Peiliang
    Wei, Jun
    Lu, Yao
    2019 IEEE 16TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2019), 2019, : 438 - 442
  • [39] A Multi-Scale Approach for Graph Link Prediction
    Cai, Lei
    Ji, Shuiwang
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 3308 - 3315
  • [40] Control of multi-scale dynamics system
    Nandong, Jobrun
    Samyudia, Yudi
    Tade, Moses O.
    PROCEEDINGS OF THE 2007 IEEE CONFERENCE ON CONTROL APPLICATIONS, VOLS 1-3, 2007, : 328 - 333