Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Features

被引：0

作者：

Starlet Ben Alex

Leena Mary

Ben P. Babu

机构：

[1] APJ Abdul Kalam Technological University,Centre for Advanced Signal Processing (CASP), Rajiv Gandhi Institute of Technology

[2] Government Engineering College,Department of Electronics and Communication Engineering

来源：

Circuits, Systems, and Signal Processing | 2020年 / 39卷

关键词：

Automatic emotion recognition (AER); Prosodic features; Syllabification; Attention mechanism; Feature selection; Score-level fusion;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

This work attempts to recognize emotions from human speech using prosodic information represented by variations in duration, energy, and fundamental frequency (F0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{0}$$\end{document}) values. For this, the speech signal is first automatically segmented into syllables. Prosodic features at the utterance (15 features) and syllable level (10 features) are extracted using the syllable boundaries and trained separately using deep neural network classifiers. The effectiveness of the proposed approach is demonstrated on German speech corpus-EMOTional Sensitivity ASistance System (EmotAsS) for people with disabilities, the dataset used for the Interspeech 2018 Atypical Affect Sub-Challenge. The initial set of prosodic features on evaluation yields an unweighted average recall (UAR) of 30.15%. A fusion of the decision scores of these features with spectral features gives a UAR of 36.71%. This paper also employs methods like attention mechanism and feature selection using resampling-based recursive feature elimination (RFE) to enhance system performance. Implementing attention and feature selection followed by a score-level fusion improves the UAR to 36.83% and 40.96% for prosodic features and overall fusion, respectively. The fusion of the scores of the best individual system of the Atypical Affect Sub-Challenge and the proposed system provides a UAR (43.71%) above the best test result reported. The effectiveness of the proposed system has also been demonstrated on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database with a UAR of 63.83%.

引用

页码：5681 / 5709

页数：28

共 50 条

[1] Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Features
Ben Alex, Starlet
Mary, Leena
Babu, Ben P.
[J]. CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2020, 39 (11) : 5681 - 5709
[2] Utterance and Syllable Level Prosodic Features for Automatic Emotion Recognition
Ben Alex, Starlet
Babu, Ben P.
Mary, Leena
[J]. 2018 IEEE RECENT ADVANCES IN INTELLIGENT COMPUTATIONAL SYSTEMS (RAICS), 2018, : 31 - 35
[3] Speech emotion recognition based on syllable-level feature extraction
Rehman, Abdul
Liu, Zhen-Tao
Wu, Min
Cao, Wei-Hua
Jiang, Cheng-Shan
[J]. APPLIED ACOUSTICS, 2023, 211
[4] Syllable-level desynchronisation of phonetic features for speech recognition
Kirchhoff, K
[J]. ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 2274 - 2276
[5] Learning Syllable-Level Discrete Prosodic Representation for Expressive Speech Generation
Zhang, Guangyan
Qin, Ying
Lee, Tan
[J]. INTERSPEECH 2020, 2020, : 3426 - 3430
[6] Prosodic feature normalization for emotion recognition by using synthesized speech
Suzuki, Motoyuki
Nakagawa, Shohei
Kita, Kenji
[J]. ADVANCES IN KNOWLEDGE-BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, 2012, 243 : 306 - 313
[7] Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition
Mocanu, Bogdan
Tapu, Ruxandra
Zaharia, Titus
[J]. SENSORS, 2021, 21 (12)
[8] Emotion Recognition from Speech using Prosodic and Linguistic Features
Pervaiz, Mahwish
Khan, Tamim Ahmed
[J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2016, 7 (08) : 84 - 90
[9] Automatic Emotion Recognition using Auditory and Prosodic Indicative Features
Gharsellaoui, Soumaya
Selouani, Sid-Ahmed
Dahmane, Adel Omar
[J]. 2015 IEEE 28TH CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING (CCECE), 2015, : 1265 - 1270
[10] Acoustic feature selection for automatic emotion recognition from speech
Rong, Jia
Li, Gang
Chen, Yi-Ping Phoebe
[J]. INFORMATION PROCESSING & MANAGEMENT, 2009, 45 (03) : 315 - 328

← 1 2 3 4 5 →