Hybrid LSTM-Transformer Model for Emotion Recognition From Speech Audio Files

被引:50
|
作者
Andayani, Felicia [1 ]
Theng, Lau Bee [1 ]
Tsun, Mark Teekit [1 ]
Chua, Caslon [2 ]
机构
[1] Swinburne Univ Technol, Fac Engn Comp & Sci, Sarawak Campus, Sarawak 93350, Malaysia
[2] Swinburne Univ Technol, Fac Sci Engn & Technol, Melbourne, Vic 3122, Australia
来源
IEEE ACCESS | 2022年 / 10卷
关键词
Feature extraction; Speech recognition; Transformers; Emotion recognition; Task analysis; Convolutional neural networks; Spectrogram; Attention mechanism; long short-term memory network; speech emotion recognition; transformer encoder;
D O I
10.1109/ACCESS.2022.3163856
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Emotion is a vital component in daily human communication and it helps people understand each other. Emotion recognition plays a crucial role in developing human-computer interaction and computer-based speech emotion recognition. In a nutshell, Speech Emotion Recognition (SER) recognizes emotion signals transmitted through human speech or daily conversation where the emotions in a speech strongly depend on temporal information. Despite the fact that much existing research showed that a hybrid system performs better than traditional single classifiers used in SER, there are some limitations in each of them. As a result, this paper discussed a proposed hybrid Long Short-Term Memory (LSTM) Network and Transformer Encoder to learn the long-term dependencies in speech signals and classify emotions. Speech features are extracted with Mel Frequency Cepstral Coefficient (MFCC) and fed into the proposed hybrid LSTM-Transformer classifier. A range of performance evaluations was conducted on the proposed LSTM-Transformer model. The results indicate that it achieves a significant recognition improvement compared with existing models offered by other published works. The proposed hybrid model reached 75.62%, 85.55%, and 72.49% recognition success with the RAVDESS, Emo-DB, and language-independent datasets.
引用
收藏
页码:36018 / 36027
页数:10
相关论文
共 50 条
  • [21] HYBRID SPEECH RECOGNITION WITH DEEP BIDIRECTIONAL LSTM
    Graves, Alex
    Jaitly, Navdeep
    Mohamed, Abdel-Rahman
    2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2013, : 273 - 278
  • [22] Speech-Based Techniques for Emotion Detection in Natural Arabic Audio Files
    Kaloub, Ashraf
    Elgabar, Eltyeb Abed
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2025, 22 (01) : 139 - 157
  • [23] Attention-LSTM-Attention Model for Speech Emotion Recognition and Analysis of IEMOCAP Database
    Yu, Yeonguk
    Kim, Yoon-Joong
    ELECTRONICS, 2020, 9 (05)
  • [24] An enhanced speech emotion recognition using vision transformer
    Akinpelu, Samson
    Viriri, Serestina
    Adegun, Adekanmi
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [25] Multimodal transformer augmented fusion for speech emotion recognition
    Wang, Yuanyuan
    Gu, Yu
    Yin, Yifei
    Han, Yingping
    Zhang, He
    Wang, Shuang
    Li, Chenyu
    Quan, Dou
    FRONTIERS IN NEUROROBOTICS, 2023, 17
  • [26] GCFormer: A Graph Convolutional Transformer for Speech Emotion Recognition
    Gao, Yingxue
    Zhao, Huan
    Xiao, Yufeng
    Zhang, Zixing
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, : 307 - 313
  • [27] Time series prediction model using LSTM-Transformer neural network for mine water inflow
    Shi, Junwei
    Wang, Shiqi
    Qu, Pengfei
    Shao, Jianli
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [28] Speech Emotion Recognition using MFCC features and LSTM network
    Kumbhar, Harshawardhan S.
    Bhandari, Sheetal U.
    2019 5TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION, CONTROL AND AUTOMATION (ICCUBEA), 2019,
  • [29] Attention-Based Dense LSTM for Speech Emotion Recognition
    Xie, Yue
    Liang, Ruiyu
    Liang, Zhenlin
    Zhao, Li
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2019, E102D (07): : 1426 - 1429
  • [30] SPEECH EMOTION RECOGNITION WITH DUAL-SEQUENCE LSTM ARCHITECTURE
    Wang, Jianyou
    Xue, Michael
    Culhane, Ryan
    Diao, Enmao
    Ding, Jie
    Tarokh, Vahid
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6474 - 6478