Deep temporal clustering features for speech emotion recognition

被引:1
|
作者
Lin, Wei-Cheng [1 ]
Busso, Carlos [1 ]
机构
[1] Univ Texas Dallas, Dept Elect & Comp Engn, 800 W Campbell Rd, Richardson, TX 75080 USA
基金
美国国家科学基金会;
关键词
Deep clustering; Temporal modeling; Semi-supervised learning; Speech emotion recognition; AUTOENCODERS; CORPUS;
D O I
10.1016/j.specom.2023.103027
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Deep clustering is a popular unsupervised technique for feature representation learning. We recently proposed the chunk -based DeepEmoCluster framework for speech emotion recognition (SER) to adopt the concept of deep clustering as a novel semi -supervised learning (SSL) framework, which achieved improved recognition performances over conventional reconstruction -based approaches. However, the vanilla DeepEmoCluster lacks critical sentence -level temporal information that is useful for SER tasks. This study builds upon the DeepEmoCluster framework, creating a powerful SSL approach that leverages temporal information within a sentence. We propose two sentence -level temporal modeling alternatives using either the temporal -net or the triplet loss function, resulting in a novel temporal -enhanced DeepEmoCluster framework to capture essential temporal information. The key contribution to achieving this goal is the proposed sentence -level uniform sampling strategy, which preserves the original temporal order of the data for the clustering process. An extra network module (e.g., gated recurrent unit) is utilized for the temporal -net option to encode temporal information across the data chunks. Alternatively, we can impose additional temporal constraints by using the triplet loss function while training the DeepEmoCluster framework, which does not increase model complexity. Our experimental results based on the MSP-Podcast corpus demonstrate that the proposed temporal -enhanced framework significantly outperforms the vanilla DeepEmoCluster framework and other existing SSL approaches in regression tasks for the emotional attributes arousal, dominance, and valence. The improvements are observed in fully -supervised learning or SSL implementations. Further analyses validate the effectiveness of the proposed temporal modeling, showing (1) high temporal consistency in the cluster assignment, and (2) well -separated emotional patterns in the generated clusters.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM
    Mustaqeem
    Sajjad, Muhammad
    Kwon, Soonil
    [J]. IEEE ACCESS, 2020, 8 : 79861 - 79875
  • [2] Improvement of Speech Emotion Recognition by Deep Convolutional Neural Network and Speech Features
    Mohanty, Aniruddha
    Cherukuri, Ravindranath C.
    Prusty, Alok Ranjan
    [J]. THIRD CONGRESS ON INTELLIGENT SYSTEMS, CIS 2022, VOL 1, 2023, 608 : 117 - 129
  • [3] Deep spatio-temporal features for multimodal emotion recognition
    Nguyen, Dung
    Nguyen, Kien
    Sridharan, Sridha
    Ghasemi, Afsane
    Dean, David
    Fookes, Clinton
    [J]. 2017 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2017), 2017, : 1215 - 1223
  • [4] Deep Learning Algorithms for Speech Emotion Recognition with Hybrid Spectral Features
    Kogila R.
    Sadanandam M.
    Bhukya H.
    [J]. SN Computer Science, 5 (1)
  • [5] Learning deep multimodal affective features for spontaneous speech emotion recognition
    Zhang, Shiqing
    Tao, Xin
    Chuang, Yuelong
    Zhao, Xiaoming
    [J]. SPEECH COMMUNICATION, 2021, 127 : 73 - 81
  • [6] Urdu Speech Emotion Recognition using Speech Spectral Features and Deep Learning Techniques
    Taj, Soonh
    Shaikh, Ghulam Mujtaba
    Hassan, Saif
    Nimra
    [J]. 2023 4th International Conference on Computing, Mathematics and Engineering Technologies: Sustainable Technologies for Socio-Economic Development, iCoMET 2023, 2023,
  • [7] Temporal Context in Speech Emotion Recognition
    Xia, Yangyang
    Chen, Li-Wei
    Rudnicky, Alexander
    Stern, Richard M.
    [J]. INTERSPEECH 2021, 2021, : 3370 - 3374
  • [8] Pattern recognition and features selection for speech emotion recognition model using deep learning
    Jermsittiparsert, Kittisak
    Abdurrahman, Abdurrahman
    Siriattakul, Parinya
    Sundeeva, Ludmila A.
    Hashim, Wahidah
    Rahim, Robbi
    Maseleno, Andino
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2020, 23 (04) : 799 - 806
  • [9] Pattern recognition and features selection for speech emotion recognition model using deep learning
    Kittisak Jermsittiparsert
    Abdurrahman Abdurrahman
    Parinya Siriattakul
    Ludmila A. Sundeeva
    Wahidah Hashim
    Robbi Rahim
    Andino Maseleno
    [J]. International Journal of Speech Technology, 2020, 23 : 799 - 806
  • [10] Speech Emotion Recognition with Deep Learning
    Harar, Pavol
    Burget, Radim
    Dutta, Malay Kishore
    [J]. 2017 4TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND INTEGRATED NETWORKS (SPIN), 2017, : 137 - 140