Modulation spectral features for speech emotion recognition using deep neural networks

被引:18
|
作者
Singh, Premjeet [1 ]
Sahidullah, Md [2 ]
Saha, Goutam [1 ]
机构
[1] Indian Inst Technol Kharagpur, Dept Elect & Elect Commun Engn, Kharagpur 721302, India
[2] Univ Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France
关键词
Constant-Q transform; Convolutional neural network; Modulation spectrogram; Gammatone spectrogram; Shift invariance; Speech emotion recognition; REPRESENTATIONS; MUSIC; PURSUIT; PROSODY;
D O I
10.1016/j.specom.2022.11.005
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This work explores the use of constant-Q transform based modulation spectral features (CQT-MSF) for speech emotion recognition (SER). The human perception and analysis of sound comprise of two important cognitive parts: early auditory analysis and cortex-based processing. The early auditory analysis considers spectrogram-based representation whereas cortex-based analysis includes extraction of temporal modulations from the spectrogram. This temporal modulation representation of spectrogram is called modulation spectral feature (MSF). As the constant-Q transform (CQT) provides higher resolution at emotion salient low -frequency regions of speech, we find that CQT-based spectrogram, together with its temporal modulations, provides a representation enriched with emotion-specific information. We argue that CQT-MSF when used with a 2-dimensional convolutional network can provide a time-shift invariant and deformation insensitive representation for SER. Our results show that CQT-MSF outperforms standard mel-scale based spectrogram and its modulation features on two popular SER databases, Berlin EmoDB and RAVDESS. We also show that our proposed feature outperforms the shift and deformation invariant scattering transform coefficients, hence, showing the importance of joint hand-crafted and self-learned feature extraction instead of reliance on complete hand-crafted features. Finally, we perform Grad-CAM analysis to visually inspect the contribution of constant-Q modulation features over SER.
引用
收藏
页码:53 / 69
页数:17
相关论文
共 50 条
  • [21] Deep Convolutional Neural Networks for Feature Extraction in Speech Emotion Recognition
    Heracleous, Panikos
    Mohammad, Yasser
    Yoneyama, Akio
    [J]. HUMAN-COMPUTER INTERACTION. RECOGNITION AND INTERACTION TECHNOLOGIES, HCI 2019, PT II, 2019, 11567 : 117 - 132
  • [22] Improvement on Speech Emotion Recognition Based on Deep Convolutional Neural Networks
    Niu, Yafeng
    Zou, Dongsheng
    Niu, Yadong
    He, Zhongshi
    Tan, Hua
    [J]. PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON COMPUTING AND ARTIFICIAL INTELLIGENCE (ICCAI 2018), 2018, : 13 - 18
  • [23] A Study on Speech Emotion Recognition Using a Deep Neural Network
    Lee, Kyong Hee
    Choi, Hyun Kyun
    Jang, Byung Tae
    Kim, Do Hyun
    [J]. 2019 10TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY CONVERGENCE (ICTC): ICT CONVERGENCE LEADING THE AUTONOMOUS FUTURE, 2019, : 1162 - 1165
  • [24] Parallelized Convolutional Recurrent Neural Network With Spectral Features for Speech Emotion Recognition
    Jiang, Pengxu
    Fu, Hongliang
    Tao, Huawei
    Lei, Peizhi
    Zhao, Li
    [J]. IEEE ACCESS, 2019, 7 : 90368 - 90377
  • [25] Emotion Recognition Using Pretrained Deep Neural Networks
    Dobes, Marek
    Sabolova, Natalia
    [J]. ACTA POLYTECHNICA HUNGARICA, 2023, 20 (04) : 195 - 204
  • [26] Visual Emotion Recognition Using Deep Neural Networks
    Iliev, Alexander I.
    Mote, Ameya
    [J]. DIGITAL PRESENTATION AND PRESERVATION OF CULTURAL AND SCIENTIFIC HERITAGE, 2022, 12 : 77 - 88
  • [27] Multimodal Emotion Recognition Using Deep Neural Networks
    Tang, Hao
    Liu, Wei
    Zheng, Wei-Long
    Lu, Bao-Liang
    [J]. NEURAL INFORMATION PROCESSING (ICONIP 2017), PT IV, 2017, 10637 : 811 - 819
  • [28] Speech Emotion Recognition using MFCC and Hybrid Neural Networks
    Badr, Youakim
    Mukherjee, Partha
    Thumati, Sindhu
    [J]. PROCEEDINGS OF THE 13TH INTERNATIONAL JOINT CONFERENCE ON COMPUTATIONAL INTELLIGENCE (IJCCI), 2021, : 366 - 373
  • [29] SPEECH EMOTION RECOGNITION USING QUATERNION CONVOLUTIONAL NEURAL NETWORKS
    Muppidi, Aneesh
    Radfar, Martin
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6309 - 6313
  • [30] An Effective Speech Emotion Recognition Using Artificial Neural Networks
    Anoop, V.
    Rao, P. V.
    Aruna, S.
    [J]. INTERNATIONAL PROCEEDINGS ON ADVANCES IN SOFT COMPUTING, INTELLIGENT SYSTEMS AND APPLICATIONS, ASISA 2016, 2018, 628 : 393 - 401