EMOTION-CONTROLLABLE SPEECH SYNTHESIS USING EMOTION SOFT LABELS AND FINE-GRAINED PROSODY FACTORS

被引：0

作者：

Luo, Xuan ^{[1
]}

Takamichi, Shinnosuke ^{[1
]}

Koriyama, Tomoki ^{[1
]}

Saito, Yuki ^{[1
]}

Saruwatari, Hiroshi ^{[1
]}

机构：

[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan

来源：

2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) | 2021年

关键词：

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We propose an emotion-controllable text-to-speech (TTS) model that allows both emotion-level (i.e., coarse-grained) and prosody-factor-level (i.e., fine-grained) control of speech using both emotion soft labels and prosody factors. Conventional methods control speech only by using emotion labels, emotion strength, or prosody factors (e.g., mean and standard deviation of pitch), which cannot express diverse emotions. Our model is based on a speech emotion recognizer (SER) and a prosody factor generator (PFG) model that encodes utterance-level prosody factors into emotion soil labels and decodes encoded emotion soft labels back into utterance-level prosody factors. Our model enables emotion labels and prosody factors to control synthetic speech emotion. Experiment results show that the emotion-perceptual accuracy of synthetic speech reached 66 %, and the mean opinion score for the naturalness of emotionally controlled synthetic speech was 3.9, which is comparable to a conventional method that only uses prosody factors.

引用

页码：794 / 799

页数：6

共 50 条

[1] Emotion-controllable Speech Synthesis Using Emotion Soft Label, Utterance-level Prosodic Factors, and Word-level Prominence
Luo, Xuan
Takamichi, Shinnosuke
Saito, Yuki
Koriyama, Tomoki
Saruwatari, Hiroshi
APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2024, 13 (01)
[2] EMOTION NEURAL TRANSDUCER FOR FINE-GRAINED SPEECH EMOTION RECOGNITION
Shen, Siyuan
Gao, Yu
Liu, Feng
Wang, Hanyang
Zhou, Aimin
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, : 10111 - 10115
[3] Improving Fine-Grained Emotion Control and Transfer with Gated Emotion Representations in Speech Synthesis
Ye, Jianhao
He, Tianwei
Zhou, Hongbin
Ren, Kaimeng
He, Wendi
Lu, Heng
MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022, 2023, 1765 : 196 - 207
[4] FINE-GRAINED EMOTION STRENGTH TRANSFER, CONTROL AND PREDICTION FOR EMOTIONAL SPEECH SYNTHESIS
Lei, Yi
Yang, Shan
Xie, Lei
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 423 - 430
[5] Fine-Grained Emotion Prediction by Modeling Emotion Definitions
Singh, Gargi
Brahma, Dhanajit
Rai, Piyush
Modi, Ashutosh
2021 9TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2021,
[6] Fine-grained prosody modeling in neural speech synthesis using ToBI representation
Zou, Yuxiang
Liu, Shichao
Yin, Xiang
Lin, Haopeng
Wang, Chunfeng
Zhang, Haoyu
Ma, Zejun
INTERSPEECH 2021, 2021, : 3146 - 3150
[7] EMOQ-TTS: EMOTION INTENSITY QUANTIZATION FOR FINE-GRAINED CONTROLLABLE EMOTIONAL TEXT-TO-SPEECH
Im, Chae-Bin
Lee, Sang-Hoon
Kim, Seung-Bin
Lee, Seong-Whan
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6317 - 6321
[8] Affective Computing for Social Companion Robots Using Fine-grained Speech Emotion Recognition
Ahuja, Saransh
Shabani, Amir
2023 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI, 2023, : 331 - 332
[9] Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition
Li, Hang
Ding, Wenbiao
Wu, Zhongqin
Liu, Zitao
INTERSPEECH 2021, 2021, : 3375 - 3379
[10] Fine-Grained Emotion Comprehension: Semisupervised Multimodal Emotion and Intensity Recognition
Fang, Zheng
Liu, Zhen
Liu, Tingting
Hung, Chih-Chieh
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024,

← 1 2 3 4 5 →