Generative emotional AI for speech emotion recognition: The case for synthetic emotional speech augmentation

被引:7
|
作者
Latif, Siddique [1 ]
Shahid, Abdullah [2 ]
Qadir, Junaid [3 ]
机构
[1] Queensland Univ Technol, Brisbane, Australia
[2] Informat Technol Univ ITU, Lahore, Punjab, Pakistan
[3] Qatar Univ, Doha, Qatar
关键词
Tacotron; WaveRNN; Speech synthesis; Text-to-speech; Emotional speech synthesis; Speech emotion recognition;
D O I
10.1016/j.apacoust.2023.109425
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Despite advances in deep learning, current state-of-the-art speech emotion recognition (SER) systems still have poor performance due to a lack of speech emotion datasets. This paper proposes augmenting SER systems with synthetic emotional speech generated by an end-to-end text-to-speech (TTS) system based on an extended Tacotron 2 architecture. The proposed TTS system includes encoders for speaker and emotion embeddings, a sequence-to-sequence text generator for creating Mel-spectrograms, and a WaveRNN to generate audio from the Mel-spectrograms. Extensive experiments show that the quality of the generated emotional speech can significantly improve SER performance on multiple datasets, as demonstrated by a higher mean opinion score (MOS) compared to the baseline. The generated samples were also effective at augmenting SER performance.(c) 2023 Elsevier Ltd. All rights reserved.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] Augmenting Generative Adversarial Networks for Speech Emotion Recognition
    Latif, Siddique
    Asim, Muhammad
    Rana, Rajib
    Khalifa, Sara
    Jurdak, Raja
    Schuller, Bjoern W.
    [J]. INTERSPEECH 2020, 2020, : 521 - 525
  • [42] Emotional Intelligence, Not Music Training, Predicts Recognition of Emotional Speech Prosody
    Trimmer, Christopher G.
    Cuddy, Lola L.
    [J]. EMOTION, 2008, 8 (06) : 838 - 849
  • [43] Reduced Feature Extraction for Emotional Speech Recognition
    Palo, Hemanta Kumar
    Mohanty, Mihir Narayan
    [J]. 2015 ANNUAL IEEE INDIA CONFERENCE (INDICON), 2015,
  • [44] Towards more reality in the recognition of emotional speech
    Schuller, Bjoern
    Seppi, Dino
    Batliner, Anton
    Maier, Andreas
    Steidl, Stefan
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 941 - +
  • [45] A Cross-Corpus Recognition of Emotional Speech
    Xiao, Zhongzhe
    Wu, Di
    Zhang, Xiaojun
    Tao, Zhi
    [J]. PROCEEDINGS OF 2016 9TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID), VOL 2, 2016, : 42 - 46
  • [46] Automatic Recognition of Emotional State in Polish Speech
    Staroniewicz, Piotr
    [J]. TOWARD AUTONOMOUS, ADAPTIVE, AND CONTEXT-AWARE MULTIMODAL INTERFACES: THEORETICAL AND PRACTICAL ISSUES, 2011, 6456 : 347 - 353
  • [47] Emotional speech recognition: Resources, features, and methods
    Ververidis, Dimitrios
    Kotropoulos, Constantine
    [J]. SPEECH COMMUNICATION, 2006, 48 (09) : 1162 - 1181
  • [48] Application of Neural Networks in Emotional Speech Recognition
    Bojanic, Milana
    Crnojevic, Vladimir
    Delic, Vlado
    [J]. ELEVENTH SYMPOSIUM ON NEURAL NETWORK APPLICATIONS IN ELECTRICAL ENGINEERING (NEUREL 2012), 2012,
  • [49] How aging affects the recognition of emotional speech
    Paulmann, Silke
    Pell, Marc D.
    Kotz, Sonja A.
    [J]. BRAIN AND LANGUAGE, 2008, 104 (03) : 262 - 269
  • [50] Continuous tracking of user emotion in mandarin emotional speech
    Pao, Tsang-Long
    Chien, Charles S.
    Yeh, Jun-Heng
    Chen, Yu-Te
    Cheng, Yun-Maw
    [J]. 2007 THIRD INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATION HIDING AND MULTIMEDIA SIGNAL PROCESSING, VOL 1, PROCEEDINGS, 2007, : 47 - +