EMOTION-CONTROLLABLE SPEECH SYNTHESIS USING EMOTION SOFT LABELS AND FINE-GRAINED PROSODY FACTORS

被引:0
|
作者
Luo, Xuan [1 ]
Takamichi, Shinnosuke [1 ]
Koriyama, Tomoki [1 ]
Saito, Yuki [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan
来源
2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) | 2021年
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose an emotion-controllable text-to-speech (TTS) model that allows both emotion-level (i.e., coarse-grained) and prosody-factor-level (i.e., fine-grained) control of speech using both emotion soft labels and prosody factors. Conventional methods control speech only by using emotion labels, emotion strength, or prosody factors (e.g., mean and standard deviation of pitch), which cannot express diverse emotions. Our model is based on a speech emotion recognizer (SER) and a prosody factor generator (PFG) model that encodes utterance-level prosody factors into emotion soil labels and decodes encoded emotion soft labels back into utterance-level prosody factors. Our model enables emotion labels and prosody factors to control synthetic speech emotion. Experiment results show that the emotion-perceptual accuracy of synthetic speech reached 66 %, and the mean opinion score for the naturalness of emotionally controlled synthetic speech was 3.9, which is comparable to a conventional method that only uses prosody factors.
引用
收藏
页码:794 / 799
页数:6
相关论文
共 50 条
  • [11] CANCEREMO : A Dataset for Fine-Grained Emotion Detection
    Sosea, Tiberiu
    Caragea, Cornelia
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 8892 - 8904
  • [12] FULLY-HIERARCHICAL FINE-GRAINED PROSODY MODELING FOR INTERPRETABLE SPEECH SYNTHESIS
    Sun, Guangzhi
    Zhang, Yu
    Weiss, Ron J.
    Cao, Yuan
    Zen, Heiga
    Wu, Yonghui
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6264 - 6268
  • [13] MULTI-SPEAKER EMOTIONAL SPEECH SYNTHESIS WITH FINE-GRAINED PROSODY MODELING
    Lu, Chunhui
    Wen, Xue
    Liu, Ruolan
    Chen, Xiao
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5729 - 5733
  • [14] ROBUST AND FINE-GRAINED PROSODY CONTROL OF END-TO-END SPEECH SYNTHESIS
    Lee, Younggun
    Kim, Taesu
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5911 - 5915
  • [15] Fine-grained facial expression analysis using dimensional emotion model
    Zhou, Feng
    Kong, Shu
    Fowlkes, Charless
    Chen, Tao
    Lei, Baiying
    NEUROCOMPUTING, 2020, 392 : 38 - 49
  • [16] Fine-grained facial expression analysis using dimensional emotion model
    Zhou F.
    Kong S.
    Fowlkes C.C.
    Chen T.
    Lei B.
    Neurocomputing, 2020, 392 : 38 - 49
  • [17] EMOTION CONTROLLABLE SPEECH SYNTHESIS USING EMOTION-UNLABELED DATASET WITH THE ASSISTANCE OF CROSS-DOMAIN SPEECH EMOTION RECOGNITION
    Cai, Xiong
    Dai, Dongyang
    Wu, Zhiyong
    Li, Xiang
    Li, Jingbei
    Meng, Helen
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5734 - 5738
  • [18] A Fine-Grained Emotion Analysis Method for Chinese Microblog
    Zhou, Rui
    Zhang, Hu-yin
    Ye, Gang
    DATA SCIENCE, PT 1, 2017, 727 : 1 - 11
  • [19] Text-Based Fine-Grained Emotion Prediction
    Singh, Gargi
    Brahma, Dhanajit
    Rai, Piyush
    Modi, Ashutosh
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2024, 15 (02) : 405 - 416
  • [20] MPAF-CNN: Multiperspective aware and fine-grained fusion strategy for speech emotion recognition
    Li, Guoyan
    Hou, Junjie
    Liu, Yi
    Wei, Jianguo
    APPLIED ACOUSTICS, 2023, 214