Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech

被引:0
|
作者
Wang, Shijun [1 ]
Gudnason, Jon [2 ]
Borth, Damian [1 ]
机构
[1] Univ St Gallen, St Gallen, Switzerland
[2] Reykjavik Univ, Reykjavik, Iceland
来源
关键词
emotional representation learning; speech emotion recognition; emotional TTS;
D O I
10.21437/Interspeech.2023-1595
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Effective speech emotional representations play a key role in Speech Emotion Recognition (SER) and Emotional Text-To-Speech (TTS) tasks. However, emotional speech samples are more difficult and expensive to acquire compared with Neutral style speech, which causes one issue that most related works unfortunately neglect: imbalanced datasets. Models might overfit to the majority Neutral class and fail to produce robust and effective emotional representations. In this paper, we propose an Emotion Extractor to address this issue. We use augmentation approaches to train the model and enable it to extract effective and generalizable emotional representations from imbalanced datasets. Our empirical results show that (1) for the SER task, the proposed Emotion Extractor surpasses the state-of-the-art baseline on three imbalanced datasets; (2) the produced representations from our Emotion Extractor benefit the TTS model, and enable it to synthesize more expressive speech.
引用
收藏
页码:351 / 355
页数:5
相关论文
共 50 条
  • [1] EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model
    Cui, Chenye
    Ren, Yi
    Liu, Jinglin
    Chen, Feiyang
    Huang, Rongjie
    Lei, Ming
    Zhao, Zhou
    [J]. INTERSPEECH 2021, 2021, : 2766 - 2770
  • [2] Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability
    Liu, Rui
    Sisman, Berrak
    Li, Haizhou
    [J]. INTERSPEECH 2021, 2021, : 4648 - 4652
  • [3] Text aware Emotional Text-to-speech with BERT
    Mukherjee, Arijit
    Bansal, Shubham
    Satpal, Sandeepkumar
    Mehta, Rupesh
    [J]. INTERSPEECH 2022, 2022, : 4601 - 4605
  • [4] Modeling and synthesizing emotional speech for Catalan text-to-speech synthesis
    Iriondo, I
    Alías, F
    Melenchón, J
    Llorca, MA
    [J]. AFFECTIVE DIALOGUE SYSTEMS, PROCEEDINGS, 2004, 3068 : 197 - 208
  • [5] EMOTIONAL VOICE CONVERSION USING MULTITASK LEARNING WITH TEXT-TO-SPEECH
    Kim, Tae-Ho
    Cho, Sungjae
    Choi, Shinkook
    Park, Sejik
    Lee, Soo-Young
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7774 - 7778
  • [6] Generative emotional AI for speech emotion recognition: The case for synthetic emotional speech augmentation
    Latif, Siddique
    Shahid, Abdullah
    Qadir, Junaid
    [J]. APPLIED ACOUSTICS, 2023, 210
  • [7] An emotional speech synthesis markup language processor for multi-speaker and emotional text-to-speech applications
    Ryu, Se-Hui
    Cho, Hee
    Lee, Ju-Hyun
    Hong, Ki-Hyung
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2021, 40 (05): : 523 - 529
  • [8] Text Independent Speaker and Emotion Independent Speech Recognition in Emotional Environment
    Revathi, A.
    Venkataramani, Y.
    [J]. INFORMATION SYSTEMS DESIGN AND INTELLIGENT APPLICATIONS, VOL 1, 2015, 339 : 43 - 52
  • [9] Multi-speaker Emotional Text-to-speech Synthesizer
    Cho, Sungjae
    Lee, Soo-Young
    [J]. INTERSPEECH 2021, 2021, : 2337 - 2338
  • [10] Using learning automata in brain emotional learning for speech emotion recognition
    Farhoudi Z.
    Setayeshi S.
    Rabiee A.
    [J]. International Journal of Speech Technology, 2017, 20 (3) : 553 - 562