Improved Speech Emotion Recognition Using Channel-wise Global Head Pooling (CwGHP)

被引:3
|
作者
Chauhan, Krishna [1 ]
Sharma, Kamalesh Kumar [1 ]
Varma, Tarun [1 ]
机构
[1] Malaviya Natl Inst Technol Jaipur, Elect & Commun Engn Dept, Jaipur 302017, Rajasthan, India
关键词
Speech emotion recognition; Multihead attention; Convolutional neural network; MFCC; Adaptive pooling; SPECTRAL FEATURES; CLASSIFICATION; ATTENTION;
D O I
10.1007/s00034-023-02367-6
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
A multihead attention-based convolutional neural network (CNN) architecture known as channel-wise global head pooling is proposed to improve the classification accuracy of speech emotion recognition. A time-frequency kernel is used in two-dimensional convolution to emphasize both the scales in mel-frequency-cepstral-coefficients. Following the CNN encoder, a multihead attention network is optimized to learn salient discriminating characteristics of audio samples on the three emotional speech datasets, including the interactive emotional dyadic motion capture in English, the Berlin emotional speech dataset in the German language, and Ryerson audio-visual database of emotional speech and song in North American English. The proposed model's robustness is demonstrated in these diverse language datasets. A chunk-level classification approach is utilized for model training with source labels for each segment. While performing the model evaluation, an aggregation of emotions is applied to achieve the emotional sample classification. The classification accuracy is improved to 84.89% and 82.87% unweighted accuracy (UA) and weighted accuracy (WA) on the IEMOCAP dataset. It is the state-of-the-art performance on this speech corpus compared to (79.34% of WA and 77.54% of UA) using only audio modality; the proposed method achieved a UA improvement of more than 7%. Furthermore, it validated the model on two other datasets via a series of experiments that yielded acceptable results. The model is investigated using WA and UA. Additionally, statistical parameters, including precision, recall and F1-score, are also used to estimate the effectiveness of each emotion class.
引用
收藏
页码:5500 / 5522
页数:23
相关论文
共 50 条
  • [21] DCCRN plus : Channel-wise Subband DCCRN with SNR Estimation for Speech Enhancement
    Lv, Shubo
    Hu, Yanxin
    Zhang, Shimin
    Xie, Lei
    INTERSPEECH 2021, 2021, : 2816 - 2820
  • [22] IMPROVED SPEECH EMOTION RECOGNITION USING ERROR CORRECTING CODES
    Chakraborty, Rupayan
    Kopparapu, Sunil Kumar
    2016 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2016,
  • [23] Improved Channel-Wise Semantic Alignment for Few-Shot Object Detection
    Xiang, Min
    Qin, Lifeng
    Han, Ruizi
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XI, ICIC 2024, 2024, 14872 : 50 - 61
  • [24] A novel skip connection mechanism based on channel-wise cross transformer for speech enhancement
    Jiang, Weiqi
    Sun, Chengli
    Chen, Feilong
    Leng, Yan
    Guo, Qiaosheng
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (12) : 34849 - 34866
  • [25] A novel skip connection mechanism based on channel-wise cross transformer for speech enhancement
    Weiqi Jiang
    Chengli Sun
    Feilong Chen
    Yan Leng
    Qiaosheng Guo
    Multimedia Tools and Applications, 2024, 83 : 34849 - 34866
  • [26] Convolution neural network with multiple pooling strategies for speech emotion recognition
    Jiang, Pengxu
    Zou, Cairong
    2022 6TH INTERNATIONAL SYMPOSIUM ON COMPUTER SCIENCE AND INTELLIGENT CONTROL, ISCSIC, 2022, : 89 - 92
  • [27] An Attention Pooling based Representation Learning Method for Speech Emotion Recognition
    Li, Pengcheng
    Song, Yan
    McLoughlin, Ian
    Guo, Wu
    Dai, Lirong
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3087 - 3091
  • [28] Polyphonic sound event localization and detection using channel-wise FusionNet
    Spoorthy, V.
    Kooolagudi, Shashidhar G.
    APPLIED INTELLIGENCE, 2024, 54 (06) : 5015 - 5026
  • [29] CarveNet: a channel-wise attention-based network for irregular scene text recognition
    Guibin Wu
    Zheng Zhang
    Yongping Xiong
    International Journal on Document Analysis and Recognition (IJDAR), 2022, 25 : 177 - 186
  • [30] CarveNet: a channel-wise attention-based network for irregular scene text recognition
    Wu, Guibin
    Zhang, Zheng
    Xiong, Yongping
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2022, 25 (3) : 177 - 186