Improved Speech Emotion Recognition Using Channel-wise Global Head Pooling (CwGHP)

被引:3
|
作者
Chauhan, Krishna [1 ]
Sharma, Kamalesh Kumar [1 ]
Varma, Tarun [1 ]
机构
[1] Malaviya Natl Inst Technol Jaipur, Elect & Commun Engn Dept, Jaipur 302017, Rajasthan, India
关键词
Speech emotion recognition; Multihead attention; Convolutional neural network; MFCC; Adaptive pooling; SPECTRAL FEATURES; CLASSIFICATION; ATTENTION;
D O I
10.1007/s00034-023-02367-6
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
A multihead attention-based convolutional neural network (CNN) architecture known as channel-wise global head pooling is proposed to improve the classification accuracy of speech emotion recognition. A time-frequency kernel is used in two-dimensional convolution to emphasize both the scales in mel-frequency-cepstral-coefficients. Following the CNN encoder, a multihead attention network is optimized to learn salient discriminating characteristics of audio samples on the three emotional speech datasets, including the interactive emotional dyadic motion capture in English, the Berlin emotional speech dataset in the German language, and Ryerson audio-visual database of emotional speech and song in North American English. The proposed model's robustness is demonstrated in these diverse language datasets. A chunk-level classification approach is utilized for model training with source labels for each segment. While performing the model evaluation, an aggregation of emotions is applied to achieve the emotional sample classification. The classification accuracy is improved to 84.89% and 82.87% unweighted accuracy (UA) and weighted accuracy (WA) on the IEMOCAP dataset. It is the state-of-the-art performance on this speech corpus compared to (79.34% of WA and 77.54% of UA) using only audio modality; the proposed method achieved a UA improvement of more than 7%. Furthermore, it validated the model on two other datasets via a series of experiments that yielded acceptable results. The model is investigated using WA and UA. Additionally, statistical parameters, including precision, recall and F1-score, are also used to estimate the effectiveness of each emotion class.
引用
收藏
页码:5500 / 5522
页数:23
相关论文
共 50 条
  • [1] Improved Speech Emotion Recognition Using Channel-wise Global Head Pooling (CwGHP)
    Krishna Chauhan
    Kamalesh Kumar Sharma
    Tarun Varma
    Circuits, Systems, and Signal Processing, 2023, 42 : 5500 - 5522
  • [2] Emotion Recognition based BCI using Channel-wise Features
    Jin, Longbin
    CHI'20: EXTENDED ABSTRACTS OF THE 2020 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, 2020,
  • [3] Global Pooling, More than Meets the Eye: Position Information is Encoded Channel-Wise in CNNs
    Islam, Md Amirul
    Kowal, Matthew
    Jia, Sen
    Derpanis, Konstantinos G.
    Bruce, Neil D. B.
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 773 - 781
  • [4] Interpretable Cross-Subject EEG-Based Emotion Recognition Using Channel-Wise Features†
    Jin, Longbin
    Kim, Eun Yi
    SENSORS, 2020, 20 (23) : 1 - 18
  • [5] EEG-Based Emotion Recognition via Channel-Wise Attention and Self Attention
    Tao, Wei
    Li, Chang
    Song, Rencheng
    Cheng, Juan
    Liu, Yu
    Wan, Feng
    Chen, Xun
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (01) : 382 - 393
  • [6] Stable Speech Emotion Recognition with Head-k-Pooling Loss
    Ding, Chaoyue
    Li, Jiakui
    Zong, Daoming
    Li, Baoxiang
    Zhang, Tianhao
    Zhou, Qunyan
    INTERSPEECH 2023, 2023, : 661 - 665
  • [7] CHANNEL-WISE AV-FUSION ATTENTION FOR MULTI-CHANNEL AUDIO-VISUAL SPEECH RECOGNITION
    Xu, Gaopeng
    Yang, Song
    Li, Wei
    Wang, Song
    Wei, Guo
    Yuan, Junfeng
    Gao, Jie
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 9251 - 9255
  • [8] Feature Pooling of Modulation Spectrum Features for Improved Speech Emotion Recognition in the Wild
    Avila, Anderson R.
    Akhtar, Zahid
    Santos, Joao F.
    O'Shaughnessy, Douglas
    Falk, Tiago H.
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2021, 12 (01) : 177 - 188
  • [9] EEG-based emotion recognition via capsule network with channel-wise attention and LSTM models
    Lina Deng
    Xiaoliang Wang
    Frank Jiang
    Robin Doss
    CCF Transactions on Pervasive Computing and Interaction, 2021, 3 : 425 - 435
  • [10] EEG-based emotion recognition via capsule network with channel-wise attention and LSTM models
    Deng, Lina
    Wang, Xiaoliang
    Jiang, Frank
    Doss, Robin
    CCF TRANSACTIONS ON PERVASIVE COMPUTING AND INTERACTION, 2021, 3 (04) : 425 - 435