FINE-GRAINED EMOTION STRENGTH TRANSFER, CONTROL AND PREDICTION FOR EMOTIONAL SPEECH SYNTHESIS

被引：32

作者：

Lei, Yi ^{[1
]}

Yang, Shan ^{[1
]}

Xie, Lei ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

关键词：

text-to-speech; expressive speech synthesis; emotion strength; sequence-to-sequence;

D O I：

10.1109/SLT48900.2021.9383524

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper proposes a unified model to conduct emotion transfer, control and prediction for sequence-to-sequence based fine-grained emotional speech synthesis. Conventional emotional speech synthesis often needs manual labels or reference audio to determine the emotional expressions of synthesized speech. Such coarse labels cannot control the details of speech emotion, often resulting in an averaged emotion expression delivery, and it is also hard to choose suitable reference audio during inference. To conduct fine-grained emotion expression generation, we introduce phoneme-level emotion strength representations through a learned ranking function to describe the local emotion details, and the sentence-level emotion category is adopted to render the global emotions of synthesized speech. With the global render and local descriptors of emotions, we can obtain fine-grained emotion expressions from reference audio via its emotion descriptors (for transfer) or directly from phoneme-level manual labels (for control). As for the emotional speech synthesis with arbitrary text inputs, the proposed model can also predict phoneme-level emotion expressions from texts, which does not require any reference audio or manual label.

引用

页码：423 / 430

页数：8

共 50 条

[31] Accounting for the microstructure for the prediction of unsaturated shear strength of remolded fine-grained soils
Mpawenayo, Regis
Gerard, Pierre
CANADIAN GEOTECHNICAL JOURNAL, 2023,
[32] Fine-Grained Emotion Comprehension: Semisupervised Multimodal Emotion and Intensity Recognition
Fang, Zheng
Liu, Zhen
Liu, Tingting
Hung, Chih-Chieh
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024,
[33] A Fine-Grained Emotion Analysis Method for Chinese Microblog
Zhou, Rui
Zhang, Hu-yin
Ye, Gang
DATA SCIENCE, PT 1, 2017, 727 : 1 - 11
[34] Strength and deformation of tailings with fine-grained interlayers
Chen, Qinglin
Zhang, Chao
Yang, Chunhe
Ma, Changkun
Pan, Zhenkai
Daemen, J. J. K.
ENGINEERING GEOLOGY, 2019, 256 : 110 - 120
[35] FULLY-HIERARCHICAL FINE-GRAINED PROSODY MODELING FOR INTERPRETABLE SPEECH SYNTHESIS
Sun, Guangzhi
Zhang, Yu
Weiss, Ron J.
Cao, Yuan
Zen, Heiga
Wu, Yonghui
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6264 - 6268
[36] Fine-grained prosody modeling in neural speech synthesis using ToBI representation
Zou, Yuxiang
Liu, Shichao
Yin, Xiang
Lin, Haopeng
Wang, Chunfeng
Zhang, Haoyu
Ma, Zejun
INTERSPEECH 2021, 2021, : 3146 - 3150
[37] Fine-Grained Access Control for Microservices
Nehme, Antonio
Jesus, Vitor
Mahbub, Khaled
Abdallah, Ali
FOUNDATIONS AND PRACTICE OF SECURITY, FPS 2018, 2019, 11358 : 285 - 300
[38] Extracting method for fine-grained emotional features in videos
Zheng, Cangzhi
Peng, Junjie
Cai, Zesu
KNOWLEDGE-BASED SYSTEMS, 2024, 302
[39] Fine-Grained Crime Prediction in an Urban Neighborhood
Kent, Christopher
Venugopal, Deepak
2018 IEEE INTERNATIONAL SMART CITIES CONFERENCE (ISC2), 2018,
[40] Hierarchical CVAE for Fine-Grained Hate Speech Classification
Qian, Jing
ElSherief, Mai
Belding, Elizabeth
Wang, William Yang
2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 3550 - 3559

← 1 2 3 4 5 →