MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis

被引:32
|
作者
Lei, Yi [1 ]
Yang, Shan [2 ]
Wang, Xinsheng [3 ,4 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, ASGO, Audio Speech & Langauge Proc Grp, Xian 710072, Peoples R China
[2] Tencent AI Lab, Beijing 100086, Peoples R China
[3] Xi An Jiao Tong Univ, Sch Software Engn, Xian 710049, Peoples R China
[4] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China
关键词
Speech synthesis; Predictive models; Analytical models; Virtual assistants; Speech; Feature extraction; Decoding; emotional speech synthesis; emotion strengths; multi-scale; PROSODY;
D O I
10.1109/TASLP.2022.3145293
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style and thus ignores the multi-scale nature of speech prosody. In this paper, we propose MsEmoTTS, a multi-scale emotional speech synthesis framework, to model the emotion from different levels. Specifically, the proposed method is a typical attention-based sequence-to-sequence model and with proposed three modules, including global-level emotion presenting module (GM), utterance-level emotion presenting module (UM), and local-level emotion presenting module (LM), to model the global emotion category, utterance-level emotion variation, and syllable-level emotion strength, respectively. In addition to modeling the emotion from different levels, the proposed method also allows us to synthesize emotional speech in different ways, i.e., transferring the emotion from reference audio, predicting the emotion from input text, and controlling the emotion strength manually. Extensive experiments conducted on a Chinese emotional speech corpus demonstrate that the proposed method outperforms the compared reference audio-based and text-based emotional speech synthesis methods on the emotion transfer speech synthesis and text-based emotion prediction speech synthesis respectively. Besides, the experiments also show that the proposed method can control the emotion expressions flexibly. Detailed analysis shows the effectiveness of each module and the good design of the proposed method.
引用
收藏
页码:853 / 864
页数:12
相关论文
共 50 条
  • [21] Multi-Scale Speaker Vectors for Zero-Shot Speech Synthesis
    Cory, Tristin
    Iqbal, Razib
    2022 IEEE 46TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2022), 2022, : 496 - 501
  • [22] MBDA: A Multi-scale Bidirectional Perception Approach for Cross-Corpus Speech Emotion Recognition
    Li, Jiayang
    Wang, Xiaoye
    Li, Siyuan
    Shi, Jia
    Xiao, Yingyuan
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT III, ICIC 2024, 2024, 14877 : 329 - 341
  • [23] TLBT-Net: A Multi-scale Cross-fusion Model for Speech Emotion Recognition
    Yu, Anli
    Sun, Xuelian
    Wu, Xiaoyang
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON MODELING, NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING, CMNM 2024, 2024, : 245 - 250
  • [24] GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion Causality for Speech Emotion Recognition*
    Ye, Jia-Xin
    Wen, Xin-Cheng
    Wang, Xuan-Ze
    Xu, Yong
    Luo, Yan
    Wu, Chang-Li
    Chen, Li-Yan
    Liu, Kun-Hong
    SPEECH COMMUNICATION, 2022, 145 : 21 - 35
  • [25] Improving Fine-Grained Emotion Control and Transfer with Gated Emotion Representations in Speech Synthesis
    Ye, Jianhao
    He, Tianwei
    Zhou, Hongbin
    Ren, Kaimeng
    He, Wendi
    Lu, Heng
    MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022, 2023, 1765 : 196 - 207
  • [26] Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework
    Liu, Yang
    Sun, Haoqin
    Guan, Wenbo
    Xia, Yuqi
    Zhao, Zhen
    SPEECH COMMUNICATION, 2022, 139 : 1 - 9
  • [27] Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework
    Liu, Yang
    Sun, Haoqin
    Guan, Wenbo
    Xia, Yuqi
    Zhao, Zhen
    Speech Communication, 2022, 139 : 1 - 9
  • [28] Embroidery style transfer modeling based on multi-scale texture synthesis
    Yao L.
    Zhang Y.
    Yao L.
    Zheng X.
    Wei W.
    Liu C.
    Fangzhi Xuebao/Journal of Textile Research, 2023, 44 (09): : 84 - 90
  • [29] EEG Emotion Recognition by Fusion of Multi-Scale Features
    Du, Xiuli
    Meng, Yifei
    Qiu, Shaoming
    Lv, Yana
    Liu, Qingli
    BRAIN SCIENCES, 2023, 13 (09)
  • [30] Multi-scale approach for the prediction of atomic scale properties
    Grisafi, Andrea
    Nigam, Jigyasa
    Ceriotti, Michele
    CHEMICAL SCIENCE, 2021, 12 (06) : 2078 - 2090