Modulation spectral features for speech emotion recognition using deep neural networks

被引:18
|
作者
Singh, Premjeet [1 ]
Sahidullah, Md [2 ]
Saha, Goutam [1 ]
机构
[1] Indian Inst Technol Kharagpur, Dept Elect & Elect Commun Engn, Kharagpur 721302, India
[2] Univ Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France
关键词
Constant-Q transform; Convolutional neural network; Modulation spectrogram; Gammatone spectrogram; Shift invariance; Speech emotion recognition; REPRESENTATIONS; MUSIC; PURSUIT; PROSODY;
D O I
10.1016/j.specom.2022.11.005
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This work explores the use of constant-Q transform based modulation spectral features (CQT-MSF) for speech emotion recognition (SER). The human perception and analysis of sound comprise of two important cognitive parts: early auditory analysis and cortex-based processing. The early auditory analysis considers spectrogram-based representation whereas cortex-based analysis includes extraction of temporal modulations from the spectrogram. This temporal modulation representation of spectrogram is called modulation spectral feature (MSF). As the constant-Q transform (CQT) provides higher resolution at emotion salient low -frequency regions of speech, we find that CQT-based spectrogram, together with its temporal modulations, provides a representation enriched with emotion-specific information. We argue that CQT-MSF when used with a 2-dimensional convolutional network can provide a time-shift invariant and deformation insensitive representation for SER. Our results show that CQT-MSF outperforms standard mel-scale based spectrogram and its modulation features on two popular SER databases, Berlin EmoDB and RAVDESS. We also show that our proposed feature outperforms the shift and deformation invariant scattering transform coefficients, hence, showing the importance of joint hand-crafted and self-learned feature extraction instead of reliance on complete hand-crafted features. Finally, we perform Grad-CAM analysis to visually inspect the contribution of constant-Q modulation features over SER.
引用
收藏
页码:53 / 69
页数:17
相关论文
共 50 条
  • [41] Amplitude Modulation Features for Emotion Recognition from Speech
    Alam, Md Jahangir
    Attabi, Yazid
    Dumouchel, Pierre
    Kenny, Patrick
    O'Shaughnessy, D.
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 2419 - 2423
  • [42] Active Learning for Speech Emotion Recognition Using Deep Neural Network
    Abdelwahab, Mohammed
    Busso, Carlos
    [J]. 2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2019,
  • [43] Modulation Recognition Using Hierarchical Deep Neural Networks
    Karra, Krishna
    Kuzdeba, Scott
    Petersen, Josh
    [J]. 2017 IEEE INTERNATIONAL SYMPOSIUM ON DYNAMIC SPECTRUM ACCESS NETWORKS (IEEE DYSPAN), 2017,
  • [44] ARTICULATORY FEATURES FROM DEEP NEURAL NETWORKS AND THEIR ROLE IN SPEECH RECOGNITION
    Mitra, Vikramjit
    Sivaraman, Ganesh
    Nam, Hosung
    Espy-Wilson, Carol
    Saltzman, Elliot
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [45] EFFICIENT ARABIC EMOTION RECOGNITION USING DEEP NEURAL NETWORKS
    Hifny, Yasser
    Ali, Ahmed
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6710 - 6714
  • [46] Human Facial Emotion Recognition using Deep Neural Networks
    Benisha, S.
    Mirnalinee, T. T.
    [J]. INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2023, 20 (03) : 303 - 309
  • [47] Music emotion recognition using deep convolutional neural networks
    Li, Ting
    [J]. JOURNAL OF COMPUTATIONAL METHODS IN SCIENCES AND ENGINEERING, 2024, 24 (4-5) : 3063 - 3078
  • [48] Emotion Recognition from Semi Natural Speech Using Artificial Neural Networks and Excitation Source Features
    Koolagudi, Shashidhar G.
    Devliyal, Swati
    Barthwal, Anurag
    Rao, K. Sreenivasa
    [J]. CONTEMPORARY COMPUTING, 2012, 306 : 273 - +
  • [49] A Spectral Masking Approach to Noise-Robust Speech Recognition Using Deep Neural Networks
    Li, Bo
    Sim, Khe Chai
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (08) : 1296 - 1305
  • [50] Pattern recognition and features selection for speech emotion recognition model using deep learning
    Jermsittiparsert, Kittisak
    Abdurrahman, Abdurrahman
    Siriattakul, Parinya
    Sundeeva, Ludmila A.
    Hashim, Wahidah
    Rahim, Robbi
    Maseleno, Andino
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2020, 23 (04) : 799 - 806