End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model

被引:23
|
作者
Feng, Han [1 ]
Ueno, Sei [1 ]
Kawahara, Tatsuya [1 ]
机构
[1] Kyoto Univ, Grad Sch Informat, Sakyo Ku, Kyoto, Japan
来源
关键词
speech emotion recognition; acoustic-to-word speech recognition; end-to-end; self-attention mechanism; multi-task learning;
D O I
10.21437/Interspeech.2020-1180
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this paper, we propose speech emotion recognition (SER) combined with an acoustic-to-word automatic speech recognition (ASR) model. While acoustic prosodic features are primarily used for SER, textual features are also useful but are error-prone, especially in emotional speech. To solve this problem, we integrate ASR model and SER model in an end-to-end manner. This is done by using an acoustic-to-word model. Specifically, we utilize the states of the decoder in the ASR model with the acoustic features and input them into the SER model. On top of a recurrent network to learn features from this input, we adopt a self-attention mechanism to focus on important feature frames. Finally, we finetune the ASR model on the new dataset using a multi-task learning method to jointly optimize ASR with the SER task. Our model has achieved a 68.63% weighted accuracy (WA) and 69.67% unweighted accuracy (UA) on the IEMOCAP database, which is state-of-the-art performance.
引用
收藏
页码:501 / 505
页数:5
相关论文
共 50 条
  • [1] Modular End-to-End Automatic Speech Recognition Framework for Acoustic-to-Word Model
    Liu, Qi
    Chen, Zhehuai
    Li, Hao
    Huang, Mingkun
    Lu, Yizhou
    Yu, Kai
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 2174 - 2183
  • [2] Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation
    Yeh, Sung-Lin
    Lin, Yun-Shao
    Lee, Chi-Chun
    [J]. INTERSPEECH 2020, 2020, : 536 - 540
  • [3] AN END-TO-END MULTITASK LEARNING MODEL TO IMPROVE SPEECH EMOTION RECOGNITION
    Fu, Changzeng
    Liu, Chaoran
    Ishi, Carlos Toshinori
    Ishiguro, Hiroshi
    [J]. 28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 351 - 355
  • [4] Acoustic Word Embeddings for End-to-End Speech Synthesis
    Shen, Feiyu
    Du, Chenpeng
    Yu, Kai
    [J]. APPLIED SCIENCES-BASEL, 2021, 11 (19):
  • [5] End-to-End Speech Emotion Recognition With Gender Information
    Sun, Ting-Wei
    [J]. IEEE ACCESS, 2020, 8 : 152423 - 152438
  • [6] END-TO-END SILENT SPEECH RECOGNITION WITH ACOUSTIC SENSING
    Luo, Jian
    Wang, Jianzong
    Cheng, Ning
    Jiang, Guilin
    Xiao, Jing
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 606 - 612
  • [7] End-to-End Speech Emotion Recognition Based on Neural Network
    Zhu, Bing
    Zhou, Wenkai
    Wang, Yutian
    Wang, Hui
    Cai, Juan Juan
    [J]. 2017 17TH IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT 2017), 2017, : 1634 - 1638
  • [8] META-LEARNING FOR IMPROVING RARE WORD RECOGNITION IN END-TO-END ASR
    Lux, Florian
    Ngoc Thang Vu
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5974 - 5978
  • [9] An End-to-End model for Vietnamese speech recognition
    Van Huy Nguyen
    [J]. 2019 IEEE - RIVF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES (RIVF), 2019, : 307 - 312
  • [10] DOES SPEECH ENHANCEMENTWORK WITH END-TO-END ASR OBJECTIVES?: EXPERIMENTAL ANALYSIS OF MULTICHANNEL END-TO-END ASR
    Ochiai, Tsubasa
    Watanabe, Shinji
    Katagiri, Shigeru
    [J]. 2017 IEEE 27TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2017,