Generative emotional AI for speech emotion recognition: The case for synthetic emotional speech augmentation

被引:7
|
作者
Latif, Siddique [1 ]
Shahid, Abdullah [2 ]
Qadir, Junaid [3 ]
机构
[1] Queensland Univ Technol, Brisbane, Australia
[2] Informat Technol Univ ITU, Lahore, Punjab, Pakistan
[3] Qatar Univ, Doha, Qatar
关键词
Tacotron; WaveRNN; Speech synthesis; Text-to-speech; Emotional speech synthesis; Speech emotion recognition;
D O I
10.1016/j.apacoust.2023.109425
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Despite advances in deep learning, current state-of-the-art speech emotion recognition (SER) systems still have poor performance due to a lack of speech emotion datasets. This paper proposes augmenting SER systems with synthetic emotional speech generated by an end-to-end text-to-speech (TTS) system based on an extended Tacotron 2 architecture. The proposed TTS system includes encoders for speaker and emotion embeddings, a sequence-to-sequence text generator for creating Mel-spectrograms, and a WaveRNN to generate audio from the Mel-spectrograms. Extensive experiments show that the quality of the generated emotional speech can significantly improve SER performance on multiple datasets, as demonstrated by a higher mean opinion score (MOS) compared to the baseline. The generated samples were also effective at augmenting SER performance.(c) 2023 Elsevier Ltd. All rights reserved.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech
    Wang, Shijun
    Gudnason, Jon
    Borth, Damian
    [J]. INTERSPEECH 2023, 2023, : 351 - 355
  • [2] Application of Emotion Recognition and Modification for Emotional Telugu Speech Recognition
    Vishnu Vidyadhara Raju Vegesna
    Krishna Gurugubelli
    Anil Kumar Vuppala
    [J]. Mobile Networks and Applications, 2019, 24 : 193 - 201
  • [3] Application of Emotion Recognition and Modification for Emotional Telugu Speech Recognition
    Vegesna, Vishnu Vidyadhara Raju
    Gurugubelli, Krishna
    Vuppala, Anil Kumar
    [J]. MOBILE NETWORKS & APPLICATIONS, 2019, 24 (01): : 193 - 201
  • [4] Building a Recognition System of Speech Emotion and Emotional States
    Feng, Xiaoyan
    Watada, Junzo
    [J]. 2013 SECOND INTERNATIONAL CONFERENCE ON ROBOT, VISION AND SIGNAL PROCESSING (RVSP), 2013, : 253 - 258
  • [5] Emotion Attribute Projection for Speaker Recognition on Emotional Speech
    Bao, Huanjun
    Xu, Mingxing
    Zheng, Thomas Fang
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 601 - 604
  • [6] STARGAN FOR EMOTIONAL SPEECH CONVERSION: VALIDATED BY DATA AUGMENTATION OF END-TO-END EMOTION RECOGNITION
    Rizos, Georgios
    Baird, Alice
    Elliott, Max
    Schuller, Bjorn
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3502 - 3506
  • [7] Assessment of spontaneous emotional speech database toward emotion recognition: Intensity and similarity of perceived emotion from spontaneously expressed emotional speech
    Arimoto, Yoshiko
    Ohno, Sumio
    Iida, Hitoshi
    [J]. ACOUSTICAL SCIENCE AND TECHNOLOGY, 2011, 32 (01) : 26 - 29
  • [8] Generative Data Augmentation Guided by Triplet Loss for Speech Emotion Recognition
    Wang, Shijun
    Hemati, Hamed
    Gudnason, Jon
    Borth, Damian
    [J]. INTERSPEECH 2022, 2022, : 391 - 395
  • [9] Prominence features: Effective emotional features for speech emotion recognition
    Jing, Shaoling
    Mao, Xia
    Chen, Lijiang
    [J]. DIGITAL SIGNAL PROCESSING, 2018, 72 : 216 - 231
  • [10] Speech Emotion Recognition Based on Gender Influence in Emotional Expression
    Vasuki, P.
    Bharati, Divya R.
    [J]. INTERNATIONAL JOURNAL OF INTELLIGENT INFORMATION TECHNOLOGIES, 2019, 15 (04) : 22 - 40