Improved Speech Emotion Recognition Using Channel-wise Global Head Pooling (CwGHP)

被引：3

作者：

Chauhan, Krishna ^{[1
]}

Sharma, Kamalesh Kumar ^{[1
]}

Varma, Tarun ^{[1
]}

机构：

[1] Malaviya Natl Inst Technol Jaipur, Elect & Commun Engn Dept, Jaipur 302017, Rajasthan, India

来源：

CIRCUITS SYSTEMS AND SIGNAL PROCESSING | 2023年 / 42卷 / 09期

关键词：

Speech emotion recognition; Multihead attention; Convolutional neural network; MFCC; Adaptive pooling; SPECTRAL FEATURES; CLASSIFICATION; ATTENTION;

D O I：

10.1007/s00034-023-02367-6

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

A multihead attention-based convolutional neural network (CNN) architecture known as channel-wise global head pooling is proposed to improve the classification accuracy of speech emotion recognition. A time-frequency kernel is used in two-dimensional convolution to emphasize both the scales in mel-frequency-cepstral-coefficients. Following the CNN encoder, a multihead attention network is optimized to learn salient discriminating characteristics of audio samples on the three emotional speech datasets, including the interactive emotional dyadic motion capture in English, the Berlin emotional speech dataset in the German language, and Ryerson audio-visual database of emotional speech and song in North American English. The proposed model's robustness is demonstrated in these diverse language datasets. A chunk-level classification approach is utilized for model training with source labels for each segment. While performing the model evaluation, an aggregation of emotions is applied to achieve the emotional sample classification. The classification accuracy is improved to 84.89% and 82.87% unweighted accuracy (UA) and weighted accuracy (WA) on the IEMOCAP dataset. It is the state-of-the-art performance on this speech corpus compared to (79.34% of WA and 77.54% of UA) using only audio modality; the proposed method achieved a UA improvement of more than 7%. Furthermore, it validated the model on two other datasets via a series of experiments that yielded acceptable results. The model is investigated using WA and UA. Additionally, statistical parameters, including precision, recall and F1-score, are also used to estimate the effectiveness of each emotion class.

引用

页码：5500 / 5522

页数：23

共 50 条

[21] DCCRN plus : Channel-wise Subband DCCRN with SNR Estimation for Speech Enhancement
Lv, Shubo
Hu, Yanxin
Zhang, Shimin
Xie, Lei
INTERSPEECH 2021, 2021, : 2816 - 2820
[22] IMPROVED SPEECH EMOTION RECOGNITION USING ERROR CORRECTING CODES
Chakraborty, Rupayan
Kopparapu, Sunil Kumar
2016 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2016,
[23] Improved Channel-Wise Semantic Alignment for Few-Shot Object Detection
Xiang, Min
Qin, Lifeng
Han, Ruizi
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XI, ICIC 2024, 2024, 14872 : 50 - 61
[24] A novel skip connection mechanism based on channel-wise cross transformer for speech enhancement
Jiang, Weiqi
Sun, Chengli
Chen, Feilong
Leng, Yan
Guo, Qiaosheng
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (12) : 34849 - 34866
[25] A novel skip connection mechanism based on channel-wise cross transformer for speech enhancement
Weiqi Jiang
Chengli Sun
Feilong Chen
Yan Leng
Qiaosheng Guo
Multimedia Tools and Applications, 2024, 83 : 34849 - 34866
[26] Convolution neural network with multiple pooling strategies for speech emotion recognition
Jiang, Pengxu
Zou, Cairong
2022 6TH INTERNATIONAL SYMPOSIUM ON COMPUTER SCIENCE AND INTELLIGENT CONTROL, ISCSIC, 2022, : 89 - 92
[27] An Attention Pooling based Representation Learning Method for Speech Emotion Recognition
Li, Pengcheng
Song, Yan
McLoughlin, Ian
Guo, Wu
Dai, Lirong
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3087 - 3091
[28] Polyphonic sound event localization and detection using channel-wise FusionNet
Spoorthy, V.
Kooolagudi, Shashidhar G.
APPLIED INTELLIGENCE, 2024, 54 (06) : 5015 - 5026
[29] CarveNet: a channel-wise attention-based network for irregular scene text recognition
Guibin Wu
Zheng Zhang
Yongping Xiong
International Journal on Document Analysis and Recognition (IJDAR), 2022, 25 : 177 - 186
[30] CarveNet: a channel-wise attention-based network for irregular scene text recognition
Wu, Guibin
Zhang, Zheng
Xiong, Yongping
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2022, 25 (3) : 177 - 186

← 1 2 3 4 5 →