Improved Speech Emotion Recognition Using Channel-wise Global Head Pooling (CwGHP)

被引：3

作者：

Chauhan, Krishna ^{[1
]}

Sharma, Kamalesh Kumar ^{[1
]}

Varma, Tarun ^{[1
]}

机构：

[1] Malaviya Natl Inst Technol Jaipur, Elect & Commun Engn Dept, Jaipur 302017, Rajasthan, India

来源：

CIRCUITS SYSTEMS AND SIGNAL PROCESSING | 2023年 / 42卷 / 09期

关键词：

Speech emotion recognition; Multihead attention; Convolutional neural network; MFCC; Adaptive pooling; SPECTRAL FEATURES; CLASSIFICATION; ATTENTION;

D O I：

10.1007/s00034-023-02367-6

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

A multihead attention-based convolutional neural network (CNN) architecture known as channel-wise global head pooling is proposed to improve the classification accuracy of speech emotion recognition. A time-frequency kernel is used in two-dimensional convolution to emphasize both the scales in mel-frequency-cepstral-coefficients. Following the CNN encoder, a multihead attention network is optimized to learn salient discriminating characteristics of audio samples on the three emotional speech datasets, including the interactive emotional dyadic motion capture in English, the Berlin emotional speech dataset in the German language, and Ryerson audio-visual database of emotional speech and song in North American English. The proposed model's robustness is demonstrated in these diverse language datasets. A chunk-level classification approach is utilized for model training with source labels for each segment. While performing the model evaluation, an aggregation of emotions is applied to achieve the emotional sample classification. The classification accuracy is improved to 84.89% and 82.87% unweighted accuracy (UA) and weighted accuracy (WA) on the IEMOCAP dataset. It is the state-of-the-art performance on this speech corpus compared to (79.34% of WA and 77.54% of UA) using only audio modality; the proposed method achieved a UA improvement of more than 7%. Furthermore, it validated the model on two other datasets via a series of experiments that yielded acceptable results. The model is investigated using WA and UA. Additionally, statistical parameters, including precision, recall and F1-score, are also used to estimate the effectiveness of each emotion class.

引用

页码：5500 / 5522

页数：23

共 50 条

[41] Speech Emotion Recognition using DWT
Lalitha, S.
Mudupu, Anoop
Nandyala, Bala Visali
Munagala, Renuka
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (ICCIC), 2015, : 20 - 23
[42] A Review on Emotion Recognition using Speech
Basu, Saikat
Chakraborty, Jaybrata
Bag, Arnab
Aftabuddin, Md.
PROCEEDINGS OF THE 2017 INTERNATIONAL CONFERENCE ON INVENTIVE COMMUNICATION AND COMPUTATIONAL TECHNOLOGIES (ICICCT), 2017, : 109 - 114
[43] Speech Emotion Recognition Using CNN
Huang, Zhengwei
Dong, Ming
Mao, Qirong
Zhan, Yongzhao
PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 801 - 804
[44] Recognition of practical speech emotion using improved shuffled frog leaping algorithm
ZHANG Xiaodan
HUANG Chengwei
ZHAO Li
ZOU Cairong
ChineseJournalofAcoustics, 2014, 33 (04) : 441 - 456
[45] Recognition of practical speech emotion using improved shuffled frog leaping algorithm
Zhang, X., 1600, Science Press (39):
[46] Accumulating global channel-wise patterns via deformed-bottleneck recalibration for image classification
Nguyen, Thanh Tuan
Nguyen, Thanh Phuong
Nguyen, Vincent
PATTERN ANALYSIS AND APPLICATIONS, 2025, 28 (02)
[47] Channel-wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks
Li, Xu
Wu, Xixin
Lu, Hui
Liu, Xunying
Meng, Helen
INTERSPEECH 2021, 2021, : 4314 - 4318
[48] Multi-branch Channel-wise Enhancement Network for Fine-grained Visual Recognition
Li, Guangjun
Wang, Yongxiong
Zhu, Fengting
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5273 - 5280
[49] Channel-Wise Activation Map Pruning using MaxPool for Reducing Memory Accesses
Cho, Han
Park, Jongsun
2022 19TH INTERNATIONAL SOC DESIGN CONFERENCE (ISOCC), 2022, : 71 - 72
[50] RNN with Improved Temporal Modeling for Speech Emotion Recognition
Lieskovska, Eva
Jakubec, Maros
Jarina, Roman
2022 32ND INTERNATIONAL CONFERENCE RADIOELEKTRONIKA (RADIOELEKTRONIKA), 2022, : 5 - 9

← 1 2 3 4 5 →