Modulation spectral features for speech emotion recognition using deep neural networks

被引:18
|
作者
Singh, Premjeet [1 ]
Sahidullah, Md [2 ]
Saha, Goutam [1 ]
机构
[1] Indian Inst Technol Kharagpur, Dept Elect & Elect Commun Engn, Kharagpur 721302, India
[2] Univ Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France
关键词
Constant-Q transform; Convolutional neural network; Modulation spectrogram; Gammatone spectrogram; Shift invariance; Speech emotion recognition; REPRESENTATIONS; MUSIC; PURSUIT; PROSODY;
D O I
10.1016/j.specom.2022.11.005
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This work explores the use of constant-Q transform based modulation spectral features (CQT-MSF) for speech emotion recognition (SER). The human perception and analysis of sound comprise of two important cognitive parts: early auditory analysis and cortex-based processing. The early auditory analysis considers spectrogram-based representation whereas cortex-based analysis includes extraction of temporal modulations from the spectrogram. This temporal modulation representation of spectrogram is called modulation spectral feature (MSF). As the constant-Q transform (CQT) provides higher resolution at emotion salient low -frequency regions of speech, we find that CQT-based spectrogram, together with its temporal modulations, provides a representation enriched with emotion-specific information. We argue that CQT-MSF when used with a 2-dimensional convolutional network can provide a time-shift invariant and deformation insensitive representation for SER. Our results show that CQT-MSF outperforms standard mel-scale based spectrogram and its modulation features on two popular SER databases, Berlin EmoDB and RAVDESS. We also show that our proposed feature outperforms the shift and deformation invariant scattering transform coefficients, hence, showing the importance of joint hand-crafted and self-learned feature extraction instead of reliance on complete hand-crafted features. Finally, we perform Grad-CAM analysis to visually inspect the contribution of constant-Q modulation features over SER.
引用
收藏
页码:53 / 69
页数:17
相关论文
共 50 条
  • [1] Dimensional Emotion Recognition from Speech Using Modulation Spectral Features and Recurrent Neural Networks
    Peng, Zhichao
    Zhu, Zhi
    Unoki, Masashi
    Dang, Jianwu
    Akagi, Masato
    [J]. 2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 524 - 528
  • [2] Automatic speech emotion recognition using modulation spectral features
    Wu, Siqing
    Falk, Tiago H.
    Chan, Wai-Yip
    [J]. SPEECH COMMUNICATION, 2011, 53 (05) : 768 - 785
  • [3] Emotion recognition from speech using deep recurrent neural networks with acoustic features
    Byun, Sung-Woo
    Shin, Bo-Ra
    Lee, Seok-Pil
    Han, Hyuk-Soo
    [J]. BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2018, 123 : 43 - 44
  • [4] Speech Emotion Recognition on Mobile Devices Based on Modulation Spectral Feature Pooling and Deep Neural Networks
    Avila, Anderson R.
    Monteiro, Joao
    O'Shaughneussy, Douglas
    Falk, Tiago H.
    [J]. 2017 IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT), 2017, : 360 - 365
  • [5] Urdu Speech Emotion Recognition using Speech Spectral Features and Deep Learning Techniques
    Taj, Soonh
    Shaikh, Ghulam Mujtaba
    Hassan, Saif
    Nimra
    [J]. 2023 4th International Conference on Computing, Mathematics and Engineering Technologies: Sustainable Technologies for Socio-Economic Development, iCoMET 2023, 2023,
  • [6] Speech emotion recognition with deep convolutional neural networks
    Issa, Dias
    Demirci, M. Fatih
    Yazici, Adnan
    [J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2020, 59 (59)
  • [7] Speech Emotion Recognition using Convolution Neural Networks and Deep Stride Convolutional Neural Networks
    Wani, Taiba Majid
    Gunawan, Teddy Surya
    Qadri, Syed Asif Ahmad
    Mansor, Hasmah
    Kartiwi, Mira
    Ismail, Nanang
    [J]. PROCEEDING OF 2020 6TH INTERNATIONAL CONFERENCE ON WIRELESS AND TELEMATICS (ICWT), 2020,
  • [8] Emotion recognition in speech using neural networks
    Nicholson, J
    Takahashi, K
    Nakatsu, R
    [J]. AFFECTIVE MINDS, 2000, : 215 - 220
  • [9] Emotion recognition in speech using neural networks
    Nicholson, J
    Takahashi, K
    Nakatsu, R
    [J]. NEURAL COMPUTING & APPLICATIONS, 2000, 9 (04): : 290 - 296
  • [10] Emotion Recognition in Speech Using Neural Networks
    J. Nicholson
    K. Takahashi
    R. Nakatsu
    [J]. Neural Computing & Applications, 2000, 9 : 290 - 296