MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis

被引：32

作者：

Lei, Yi ^{[1
]}

Yang, Shan ^{[2
]}

Wang, Xinsheng ^{[3
,4
]}

Xie, Lei ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Sch Comp Sci, ASGO, Audio Speech & Langauge Proc Grp, Xian 710072, Peoples R China

[2] Tencent AI Lab, Beijing 100086, Peoples R China

[3] Xi An Jiao Tong Univ, Sch Software Engn, Xian 710049, Peoples R China

[4] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2022年 / 30卷

关键词：

Speech synthesis; Predictive models; Analytical models; Virtual assistants; Speech; Feature extraction; Decoding; emotional speech synthesis; emotion strengths; multi-scale; PROSODY;

D O I：

10.1109/TASLP.2022.3145293

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style and thus ignores the multi-scale nature of speech prosody. In this paper, we propose MsEmoTTS, a multi-scale emotional speech synthesis framework, to model the emotion from different levels. Specifically, the proposed method is a typical attention-based sequence-to-sequence model and with proposed three modules, including global-level emotion presenting module (GM), utterance-level emotion presenting module (UM), and local-level emotion presenting module (LM), to model the global emotion category, utterance-level emotion variation, and syllable-level emotion strength, respectively. In addition to modeling the emotion from different levels, the proposed method also allows us to synthesize emotional speech in different ways, i.e., transferring the emotion from reference audio, predicting the emotion from input text, and controlling the emotion strength manually. Extensive experiments conducted on a Chinese emotional speech corpus demonstrate that the proposed method outperforms the compared reference audio-based and text-based emotional speech synthesis methods on the emotion transfer speech synthesis and text-based emotion prediction speech synthesis respectively. Besides, the experiments also show that the proposed method can control the emotion expressions flexibly. Detailed analysis shows the effectiveness of each module and the good design of the proposed method.

引用

页码：853 / 864

页数：12

共 50 条

[21] Multi-Scale Speaker Vectors for Zero-Shot Speech Synthesis
Cory, Tristin
Iqbal, Razib
2022 IEEE 46TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2022), 2022, : 496 - 501
[22] MBDA: A Multi-scale Bidirectional Perception Approach for Cross-Corpus Speech Emotion Recognition
Li, Jiayang
Wang, Xiaoye
Li, Siyuan
Shi, Jia
Xiao, Yingyuan
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT III, ICIC 2024, 2024, 14877 : 329 - 341
[23] TLBT-Net: A Multi-scale Cross-fusion Model for Speech Emotion Recognition
Yu, Anli
Sun, Xuelian
Wu, Xiaoyang
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON MODELING, NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING, CMNM 2024, 2024, : 245 - 250
[24] GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion Causality for Speech Emotion Recognition*
Ye, Jia-Xin
Wen, Xin-Cheng
Wang, Xuan-Ze
Xu, Yong
Luo, Yan
Wu, Chang-Li
Chen, Li-Yan
Liu, Kun-Hong
SPEECH COMMUNICATION, 2022, 145 : 21 - 35
[25] Improving Fine-Grained Emotion Control and Transfer with Gated Emotion Representations in Speech Synthesis
Ye, Jianhao
He, Tianwei
Zhou, Hongbin
Ren, Kaimeng
He, Wendi
Lu, Heng
MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022, 2023, 1765 : 196 - 207
[26] Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework
Liu, Yang
Sun, Haoqin
Guan, Wenbo
Xia, Yuqi
Zhao, Zhen
SPEECH COMMUNICATION, 2022, 139 : 1 - 9
[27] Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework
Liu, Yang
Sun, Haoqin
Guan, Wenbo
Xia, Yuqi
Zhao, Zhen
Speech Communication, 2022, 139 : 1 - 9
[28] Embroidery style transfer modeling based on multi-scale texture synthesis
Yao L.
Zhang Y.
Yao L.
Zheng X.
Wei W.
Liu C.
Fangzhi Xuebao/Journal of Textile Research, 2023, 44 (09): : 84 - 90
[29] EEG Emotion Recognition by Fusion of Multi-Scale Features
Du, Xiuli
Meng, Yifei
Qiu, Shaoming
Lv, Yana
Liu, Qingli
BRAIN SCIENCES, 2023, 13 (09)
[30] Multi-scale approach for the prediction of atomic scale properties
Grisafi, Andrea
Nigam, Jigyasa
Ceriotti, Michele
CHEMICAL SCIENCE, 2021, 12 (06) : 2078 - 2090

← 1 2 3 4 5 →