MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis

被引：32

作者：

Lei, Yi ^{[1
]}

Yang, Shan ^{[2
]}

Wang, Xinsheng ^{[3
,4
]}

Xie, Lei ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Sch Comp Sci, ASGO, Audio Speech & Langauge Proc Grp, Xian 710072, Peoples R China

[2] Tencent AI Lab, Beijing 100086, Peoples R China

[3] Xi An Jiao Tong Univ, Sch Software Engn, Xian 710049, Peoples R China

[4] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2022年 / 30卷

关键词：

Speech synthesis; Predictive models; Analytical models; Virtual assistants; Speech; Feature extraction; Decoding; emotional speech synthesis; emotion strengths; multi-scale; PROSODY;

D O I：

10.1109/TASLP.2022.3145293

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style and thus ignores the multi-scale nature of speech prosody. In this paper, we propose MsEmoTTS, a multi-scale emotional speech synthesis framework, to model the emotion from different levels. Specifically, the proposed method is a typical attention-based sequence-to-sequence model and with proposed three modules, including global-level emotion presenting module (GM), utterance-level emotion presenting module (UM), and local-level emotion presenting module (LM), to model the global emotion category, utterance-level emotion variation, and syllable-level emotion strength, respectively. In addition to modeling the emotion from different levels, the proposed method also allows us to synthesize emotional speech in different ways, i.e., transferring the emotion from reference audio, predicting the emotion from input text, and controlling the emotion strength manually. Extensive experiments conducted on a Chinese emotional speech corpus demonstrate that the proposed method outperforms the compared reference audio-based and text-based emotional speech synthesis methods on the emotion transfer speech synthesis and text-based emotion prediction speech synthesis respectively. Besides, the experiments also show that the proposed method can control the emotion expressions flexibly. Detailed analysis shows the effectiveness of each module and the good design of the proposed method.

引用

页码：853 / 864

页数：12

共 50 条

[31] MSStyleTTS: Multi-Scale Style Modeling With Hierarchical Context Information for Expressive Speech Synthesis
Lei S.
Zhou Y.
Chen L.
Wu Z.
Wu X.
Kang S.
Meng H.
IEEE/ACM Transactions on Audio Speech and Language Processing, 2023, 31 : 3290 - 3303
[32] CM-TCN: Channel-Aware Multi-scale Temporal Convolutional Networks for Speech Emotion Recognition
Wu, Tianqi
Wang, Liejun
Zhang, Jiang
NEURAL INFORMATION PROCESSING, ICONIP 2023, PT III, 2024, 14449 : 459 - 476
[33] Speech Emotion Recognition Using Multi-Scale Global-Local Representation Learning with Feature Pyramid Network
Wang, Yuhua
Huang, Jianxing
Zhao, Zhengdao
Lan, Haiyan
Zhang, Xinjia
APPLIED SCIENCES-BASEL, 2024, 14 (24):
[34] Multi-scale Generative Adversarial Networks for Speech Enhancement
Li, Yihang
Jiang, Ting
Qin, Shan
2019 7TH IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (IEEE GLOBALSIP), 2019,
[35] MULTI-SCALE OCTAVE CONVOLUTIONS FOR ROBUST SPEECH RECOGNITION
Rownicka, Joanna
Bell, Peter
Renals, Steve
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7019 - 7023
[36] MSIPA: Multi-Scale Interval Pattern-Aware Network for ICU Transfer Prediction
Lee, Wu
Shi, Yuliang
Sun, Hongfeng
Cheng, Lin
Zhang, Kun
Wang, Xinjun
Chen, Zhiyong
ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2022, 16 (01)
[37] Multi-scale invariant fields: estimation and prediction
Ghasemi, H.
Rezakhah, S.
Modarresi, N.
JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2020, 2020 (07):
[38] MULTI-SCALE PREDICTION NETWORK FOR LUNG SEGMENTATION
Gu, Yuchong
Lai, Yaoming
Xie, Peiliang
Wei, Jun
Lu, Yao
2019 IEEE 16TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2019), 2019, : 438 - 442
[39] A Multi-Scale Approach for Graph Link Prediction
Cai, Lei
Ji, Shuiwang
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 3308 - 3315
[40] Control of multi-scale dynamics system
Nandong, Jobrun
Samyudia, Yudi
Tade, Moses O.
PROCEEDINGS OF THE 2007 IEEE CONFERENCE ON CONTROL APPLICATIONS, VOLS 1-3, 2007, : 328 - 333

← 1 2 3 4 5 →