Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Features

被引:0
|
作者
Starlet Ben Alex
Leena Mary
Ben P. Babu
机构
[1] APJ Abdul Kalam Technological University,Centre for Advanced Signal Processing (CASP), Rajiv Gandhi Institute of Technology
[2] Government Engineering College,Department of Electronics and Communication Engineering
关键词
Automatic emotion recognition (AER); Prosodic features; Syllabification; Attention mechanism; Feature selection; Score-level fusion;
D O I
暂无
中图分类号
学科分类号
摘要
This work attempts to recognize emotions from human speech using prosodic information represented by variations in duration, energy, and fundamental frequency (F0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{0}$$\end{document}) values. For this, the speech signal is first automatically segmented into syllables. Prosodic features at the utterance (15 features) and syllable level (10 features) are extracted using the syllable boundaries and trained separately using deep neural network classifiers. The effectiveness of the proposed approach is demonstrated on German speech corpus-EMOTional Sensitivity ASistance System (EmotAsS) for people with disabilities, the dataset used for the Interspeech 2018 Atypical Affect Sub-Challenge. The initial set of prosodic features on evaluation yields an unweighted average recall (UAR) of 30.15%. A fusion of the decision scores of these features with spectral features gives a UAR of 36.71%. This paper also employs methods like attention mechanism and feature selection using resampling-based recursive feature elimination (RFE) to enhance system performance. Implementing attention and feature selection followed by a score-level fusion improves the UAR to 36.83% and 40.96% for prosodic features and overall fusion, respectively. The fusion of the scores of the best individual system of the Atypical Affect Sub-Challenge and the proposed system provides a UAR (43.71%) above the best test result reported. The effectiveness of the proposed system has also been demonstrated on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database with a UAR of 63.83%.
引用
收藏
页码:5681 / 5709
页数:28
相关论文
共 50 条
  • [1] Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Features
    Ben Alex, Starlet
    Mary, Leena
    Babu, Ben P.
    [J]. CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2020, 39 (11) : 5681 - 5709
  • [2] Utterance and Syllable Level Prosodic Features for Automatic Emotion Recognition
    Ben Alex, Starlet
    Babu, Ben P.
    Mary, Leena
    [J]. 2018 IEEE RECENT ADVANCES IN INTELLIGENT COMPUTATIONAL SYSTEMS (RAICS), 2018, : 31 - 35
  • [3] Speech emotion recognition based on syllable-level feature extraction
    Rehman, Abdul
    Liu, Zhen-Tao
    Wu, Min
    Cao, Wei-Hua
    Jiang, Cheng-Shan
    [J]. APPLIED ACOUSTICS, 2023, 211
  • [4] Syllable-level desynchronisation of phonetic features for speech recognition
    Kirchhoff, K
    [J]. ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 2274 - 2276
  • [5] Learning Syllable-Level Discrete Prosodic Representation for Expressive Speech Generation
    Zhang, Guangyan
    Qin, Ying
    Lee, Tan
    [J]. INTERSPEECH 2020, 2020, : 3426 - 3430
  • [6] Prosodic feature normalization for emotion recognition by using synthesized speech
    Suzuki, Motoyuki
    Nakagawa, Shohei
    Kita, Kenji
    [J]. ADVANCES IN KNOWLEDGE-BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, 2012, 243 : 306 - 313
  • [7] Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition
    Mocanu, Bogdan
    Tapu, Ruxandra
    Zaharia, Titus
    [J]. SENSORS, 2021, 21 (12)
  • [8] Emotion Recognition from Speech using Prosodic and Linguistic Features
    Pervaiz, Mahwish
    Khan, Tamim Ahmed
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2016, 7 (08) : 84 - 90
  • [9] Automatic Emotion Recognition using Auditory and Prosodic Indicative Features
    Gharsellaoui, Soumaya
    Selouani, Sid-Ahmed
    Dahmane, Adel Omar
    [J]. 2015 IEEE 28TH CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING (CCECE), 2015, : 1265 - 1270
  • [10] Acoustic feature selection for automatic emotion recognition from speech
    Rong, Jia
    Li, Gang
    Chen, Yi-Ping Phoebe
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2009, 45 (03) : 315 - 328