Temporal Context in Speech Emotion Recognition

被引:10
|
作者
Xia, Yangyang [1 ]
Chen, Li-Wei [2 ]
Rudnicky, Alexander [2 ]
Stern, Richard M. [1 ,2 ]
机构
[1] Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA
[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
来源
基金
美国安德鲁·梅隆基金会;
关键词
speech emotion recognition; deep neural networks; prosodic features; wav2vec; learnable spectro-temporal receptive fields; REPRESENTATION;
D O I
10.21437/Interspeech.2021-1840
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
We investigate the importance of temporal context for speech emotion recognition (SER). Two SER systems trained on traditional and learned features, respectively, are developed to predict categorical labels of emotion. For traditional acoustical features, we study the combination of filterbank features and prosodic features and the impact on SER when the temporal context of these features is expanded by learnable spectro-temporal receptive fields (STRFs). Experiments show that the system trained on learnable STRFs outperforms other reported systems evaluated with a similar setup. We also demonstrate that the wav2vec features, pretrained with long temporal context, are superior to traditional features. We then introduce a novel segment-based learning objective to constrain our classifier to extract local emotion features from the large temporal context. Combined with the learning objective and fine-tuning strategy, our top-line system using wav2vec features reaches state-of-the-art performance on the IEMOCAP dataset.
引用
收藏
页码:3370 / 3374
页数:5
相关论文
共 50 条
  • [1] Speech emotion recognition in acted and spontaneous context
    Chenchah, Farah
    Lachiri, Zied
    [J]. 6TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN COMPUTER INTERACTION, IHCI 2014, 2014, 39 : 139 - 145
  • [2] RNN with Improved Temporal Modeling for Speech Emotion Recognition
    Lieskovska, Eva
    Jakubec, Maros
    Jarina, Roman
    [J]. 2022 32ND INTERNATIONAL CONFERENCE RADIOELEKTRONIKA (RADIOELEKTRONIKA), 2022, : 5 - 9
  • [3] Temporal Discrete Cosine Transform for Speech Emotion Recognition
    Popovic, Branislav
    Stankovic, Igor
    Ostrogonac, Stevan
    [J]. 2013 IEEE 4TH INTERNATIONAL CONFERENCE ON COGNITIVE INFOCOMMUNICATIONS (COGINFOCOM), 2013, : 87 - 90
  • [4] Towards Temporal Modelling of Categorical Speech Emotion Recognition
    Han, Wenjing
    Ruan, Huabin
    Chen, Xiaomin
    Wang, Zhixiang
    Li, Haifeng
    Schuller, Bjoern
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 932 - 936
  • [5] Deep temporal clustering features for speech emotion recognition
    Lin, Wei-Cheng
    Busso, Carlos
    [J]. SPEECH COMMUNICATION, 2024, 157
  • [6] SPATIO-TEMPORAL CONTEXT MODELLING FOR SPEECH EMOTION CLASSIFICATION
    Jalal, Md Asif
    Moore, Roger K.
    Hain, Thomas
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 853 - 859
  • [7] CONTEXT-AWARE ATTENTION MECHANISM FOR SPEECH EMOTION RECOGNITION
    Ramet, Gaetan
    Garner, Philip N.
    Baeriswyl, Michael
    Lazaridis, Alexandros
    [J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 126 - 131
  • [8] Speech emotion recognition with embedded attention mechanism and hierarchical context
    Cheng, Yanfen
    Chen, Yaoxin
    Chen, Yiling
    Yang, Yi
    [J]. Harbin Gongye Daxue Xuebao/Journal of Harbin Institute of Technology, 2019, 51 (11): : 100 - 107
  • [9] A Study on Speech Emotion Recognition in the Context of Voice User Experience
    Demaeght, Annebeth
    Nerb, Josef
    Mueller, Andrea
    [J]. HCI IN BUSINESS, GOVERNMENT AND ORGANIZATIONS, PT II, HCIBGO 2024, 2024, 14721 : 174 - 188
  • [10] Temporal Relation Inference Network for Multimodal Speech Emotion Recognition
    Dong, Guan-Nan
    Pun, Chi-Man
    Zhang, Zheng
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (09) : 6472 - 6485