EMOTION-CONTROLLABLE SPEECH SYNTHESIS USING EMOTION SOFT LABELS AND FINE-GRAINED PROSODY FACTORS

被引:0
|
作者
Luo, Xuan [1 ]
Takamichi, Shinnosuke [1 ]
Koriyama, Tomoki [1 ]
Saito, Yuki [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan
来源
2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) | 2021年
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose an emotion-controllable text-to-speech (TTS) model that allows both emotion-level (i.e., coarse-grained) and prosody-factor-level (i.e., fine-grained) control of speech using both emotion soft labels and prosody factors. Conventional methods control speech only by using emotion labels, emotion strength, or prosody factors (e.g., mean and standard deviation of pitch), which cannot express diverse emotions. Our model is based on a speech emotion recognizer (SER) and a prosody factor generator (PFG) model that encodes utterance-level prosody factors into emotion soil labels and decodes encoded emotion soft labels back into utterance-level prosody factors. Our model enables emotion labels and prosody factors to control synthetic speech emotion. Experiment results show that the emotion-perceptual accuracy of synthetic speech reached 66 %, and the mean opinion score for the naturalness of emotionally controlled synthetic speech was 3.9, which is comparable to a conventional method that only uses prosody factors.
引用
收藏
页码:794 / 799
页数:6
相关论文
共 50 条
  • [1] Emotion-controllable Speech Synthesis Using Emotion Soft Label, Utterance-level Prosodic Factors, and Word-level Prominence
    Luo, Xuan
    Takamichi, Shinnosuke
    Saito, Yuki
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2024, 13 (01)
  • [2] EMOTION NEURAL TRANSDUCER FOR FINE-GRAINED SPEECH EMOTION RECOGNITION
    Shen, Siyuan
    Gao, Yu
    Liu, Feng
    Wang, Hanyang
    Zhou, Aimin
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, : 10111 - 10115
  • [3] Improving Fine-Grained Emotion Control and Transfer with Gated Emotion Representations in Speech Synthesis
    Ye, Jianhao
    He, Tianwei
    Zhou, Hongbin
    Ren, Kaimeng
    He, Wendi
    Lu, Heng
    MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022, 2023, 1765 : 196 - 207
  • [4] FINE-GRAINED EMOTION STRENGTH TRANSFER, CONTROL AND PREDICTION FOR EMOTIONAL SPEECH SYNTHESIS
    Lei, Yi
    Yang, Shan
    Xie, Lei
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 423 - 430
  • [5] Fine-Grained Emotion Prediction by Modeling Emotion Definitions
    Singh, Gargi
    Brahma, Dhanajit
    Rai, Piyush
    Modi, Ashutosh
    2021 9TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2021,
  • [6] Fine-grained prosody modeling in neural speech synthesis using ToBI representation
    Zou, Yuxiang
    Liu, Shichao
    Yin, Xiang
    Lin, Haopeng
    Wang, Chunfeng
    Zhang, Haoyu
    Ma, Zejun
    INTERSPEECH 2021, 2021, : 3146 - 3150
  • [7] EMOQ-TTS: EMOTION INTENSITY QUANTIZATION FOR FINE-GRAINED CONTROLLABLE EMOTIONAL TEXT-TO-SPEECH
    Im, Chae-Bin
    Lee, Sang-Hoon
    Kim, Seung-Bin
    Lee, Seong-Whan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6317 - 6321
  • [8] Affective Computing for Social Companion Robots Using Fine-grained Speech Emotion Recognition
    Ahuja, Saransh
    Shabani, Amir
    2023 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI, 2023, : 331 - 332
  • [9] Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition
    Li, Hang
    Ding, Wenbiao
    Wu, Zhongqin
    Liu, Zitao
    INTERSPEECH 2021, 2021, : 3375 - 3379
  • [10] Fine-Grained Emotion Comprehension: Semisupervised Multimodal Emotion and Intensity Recognition
    Fang, Zheng
    Liu, Zhen
    Liu, Tingting
    Hung, Chih-Chieh
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024,