Temporal Context in Speech Emotion Recognition

被引:10
|
作者
Xia, Yangyang [1 ]
Chen, Li-Wei [2 ]
Rudnicky, Alexander [2 ]
Stern, Richard M. [1 ,2 ]
机构
[1] Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA
[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
来源
基金
美国安德鲁·梅隆基金会;
关键词
speech emotion recognition; deep neural networks; prosodic features; wav2vec; learnable spectro-temporal receptive fields; REPRESENTATION;
D O I
10.21437/Interspeech.2021-1840
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
We investigate the importance of temporal context for speech emotion recognition (SER). Two SER systems trained on traditional and learned features, respectively, are developed to predict categorical labels of emotion. For traditional acoustical features, we study the combination of filterbank features and prosodic features and the impact on SER when the temporal context of these features is expanded by learnable spectro-temporal receptive fields (STRFs). Experiments show that the system trained on learnable STRFs outperforms other reported systems evaluated with a similar setup. We also demonstrate that the wav2vec features, pretrained with long temporal context, are superior to traditional features. We then introduce a novel segment-based learning objective to constrain our classifier to extract local emotion features from the large temporal context. Combined with the learning objective and fine-tuning strategy, our top-line system using wav2vec features reaches state-of-the-art performance on the IEMOCAP dataset.
引用
收藏
页码:3370 / 3374
页数:5
相关论文
共 50 条
  • [31] Persian Speech Emotion Recognition
    Savargiv, Mohammad
    Bastanfard, Azam
    [J]. 2015 7TH CONFERENCE ON INFORMATION AND KNOWLEDGE TECHNOLOGY (IKT), 2015,
  • [32] Windowing for Speech Emotion Recognition
    Puterka, Boris
    Kacur, Juraj
    Pavlovicova, Jarmila
    [J]. 2019 61ST INTERNATIONAL SYMPOSIUM ELMAR, 2019, : 147 - 150
  • [33] Mandarin emotion recognition in speech
    Pao, TL
    Chen, YT
    [J]. ASRU'03: 2003 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING ASRU '03, 2003, : 227 - 230
  • [34] Multiroom Speech Emotion Recognition
    Shalev, Erez
    Cohen, Israel
    [J]. 2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 135 - 139
  • [35] Emotion recognition in Arabic speech
    Samira Klaylat
    Ziad Osman
    Lama Hamandi
    Rached Zantout
    [J]. Analog Integrated Circuits and Signal Processing, 2018, 96 : 337 - 351
  • [36] Learning Salient Segments for Speech Emotion Recognition Using Attentive Temporal Pooling
    Xia, Xiaohan
    Jiang, Dongmei
    Sahli, Hichem
    [J]. IEEE ACCESS, 2020, 8 : 151740 - 151752
  • [37] Attention-enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition
    Zhao, Ziping
    Bao, Zhongtian
    Zhang, Zixing
    Cummins, Nicholas
    Wang, Haishuai
    Schuller, Bjorn W.
    [J]. INTERSPEECH 2019, 2019, : 206 - 210
  • [38] Progress in speech emotion recognition
    Zhang, Xueying
    Sun, Ying
    Duan, Shufei
    [J]. TENCON 2015 - 2015 IEEE REGION 10 CONFERENCE, 2015,
  • [39] Emotion recognition in Arabic speech
    Hadjadji, Imene
    Falek, Leila
    Demri, Lyes
    Teffahi, Hocine
    [J]. 2019 INTERNATIONAL CONFERENCE ON ADVANCED ELECTRICAL ENGINEERING (ICAEE), 2019,
  • [40] Emotion Recognition from Human Speech Using Temporal Information and Deep Learning
    Kim, John W.
    Saurous, Rif A.
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 937 - 940