Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability

被引:14
|
作者
Liu, Rui [1 ,2 ]
Sisman, Berrak [1 ]
Li, Haizhou [2 ,3 ]
机构
[1] Singapore Univ Technol & Design SUTD, Singapore, Singapore
[2] Natl Univ Singapore NUS, ECE Dept, Singapore, Singapore
[3] Univ Bremen, Machine Listening Lab, Bremen, Germany
来源
关键词
Reinforcement Learning; Emotional Text-to-Speech Synthesis; Speech Emotion Recognition;
D O I
10.21437/Interspeech.2021-1236
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years. However, the generated voice is often not perceptually identifiable by its intended emotion category. To address this problem, we propose a new interactive training paradigm for ETTS, denoted as i-ETTS, which seeks to directly improve the emotion discriminability by interacting with a speech emotion recognition (SER) model. Moreover, we formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization. Experimental results demonstrate that the proposed i-ETTS outperforms the state-of-the-art baselines by rendering speech with more accurate emotion style. To our best knowledge, this is the first study of reinforcement learning in emotional text-to-speech synthesis.
引用
收藏
页码:4648 / 4652
页数:5
相关论文
共 50 条
  • [1] Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech
    Wang, Shijun
    Gudnason, Jon
    Borth, Damian
    [J]. INTERSPEECH 2023, 2023, : 351 - 355
  • [2] EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model
    Cui, Chenye
    Ren, Yi
    Liu, Jinglin
    Chen, Feiyang
    Huang, Rongjie
    Lei, Ming
    Zhao, Zhou
    [J]. INTERSPEECH 2021, 2021, : 2766 - 2770
  • [3] Modeling and synthesizing emotional speech for Catalan text-to-speech synthesis
    Iriondo, I
    Alías, F
    Melenchón, J
    Llorca, MA
    [J]. AFFECTIVE DIALOGUE SYSTEMS, PROCEEDINGS, 2004, 3068 : 197 - 208
  • [4] Statistical Text-to-Speech Synthesis with Improved Dynamics
    Tiomkin, Stas
    Malah, David
    [J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 1841 - 1844
  • [5] IMPROVED POS TAGGING FOR TEXT-TO-SPEECH SYNTHESIS
    Sun, Ming
    Bellegarda, Jerome R.
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 5384 - 5387
  • [6] Text aware Emotional Text-to-speech with BERT
    Mukherjee, Arijit
    Bansal, Shubham
    Satpal, Sandeepkumar
    Mehta, Rupesh
    [J]. INTERSPEECH 2022, 2022, : 4601 - 4605
  • [7] TEXT-TO-SPEECH SYNTHESIS
    SPROAT, RW
    OLIVE, JP
    [J]. AT&T TECHNICAL JOURNAL, 1995, 74 (02): : 35 - 44
  • [8] EMOTIONAL VOICE CONVERSION USING MULTITASK LEARNING WITH TEXT-TO-SPEECH
    Kim, Tae-Ho
    Cho, Sungjae
    Choi, Shinkook
    Park, Sejik
    Lee, Soo-Young
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7774 - 7778
  • [9] TOWARDS LIFELONG LEARNING OF MULTILINGUAL TEXT-TO-SPEECH SYNTHESIS
    Yang, Mu
    Ding, Shaojin
    Chen, Tianlong
    Wang, Tong
    Wang, Zhangyang
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8022 - 8026
  • [10] Text and Speech Corpora for Text-To-Speech Synthesis of Tales
    Doukhan, David
    Rosset, Sophie
    Rilliard, Albert
    d'Alessandro, Christophe
    Adda-Decker, Martine
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 1003 - 1010