Speech Emotion Recognition Using a Dual-Channel Complementary Spectrogram and the CNN-SSAE Neutral Network

被引:10
|
作者
Li, Juan [1 ,2 ]
Zhang, Xueying [1 ]
Huang, Lixia [1 ]
Li, Fenglian [1 ]
Duan, Shufei [1 ]
Sun, Ying [1 ]
机构
[1] Taiyuan Univ Technol, Coll Informat & Comp, Jinzhong 030600, Peoples R China
[2] Yuncheng Univ, Dept Phys & Elect Engn, Yuncheng 044000, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2022年 / 12卷 / 19期
基金
中国国家自然科学基金;
关键词
speech emotion recognition; deep learning; Mel spectrogram; IMel spectrogram; STACKED SPARSE AUTOENCODER; SPECTRAL FEATURES; STRESS RECOGNITION; NEURAL-NETWORK; MODEL; PSO;
D O I
10.3390/app12199518
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Featured Application Emotion recognition is the computer's automatic recognition of the emotional state of input speech. It is a hot research field, resulting from the mutual infiltration and interweaving of phonetics, psychology, digital signal processing, pattern recognition, and artificial intelligence. At present, speech emotion recognition has been widely used in the fields of intelligent signal processing, smart medical care, business intelligence, assistant lie detection, criminal investigation, the service industry, self-driving cars, voice assistants of smartphones, and human psychoanalysis, etc. In the background of artificial intelligence, the realization of smooth communication between people and machines has become the goal pursued by people. Mel spectrograms is a common method used in speech emotion recognition, focusing on the low-frequency part of speech. In contrast, the inverse Mel (IMel) spectrogram, which focuses on the high-frequency part, is proposed to comprehensively analyze emotions. Because the convolutional neural network-stacked sparse autoencoder (CNN-SSAE) can extract deep optimized features, the Mel-IMel dual-channel complementary structure is proposed. In the first channel, a CNN is used to extract the low-frequency information of the Mel spectrogram. The other channel extracts the high-frequency information of the IMel spectrogram. This information is transmitted into an SSAE to reduce the number of dimensions, and obtain the optimized information. Experimental results show that the highest recognition rates achieved on the EMO-DB, SAVEE, and RAVDESS datasets were 94.79%, 88.96%, and 83.18%, respectively. The conclusions are that the recognition rate of the two spectrograms was higher than that of each of the single spectrograms, which proves that the two spectrograms are complementary. The SSAE followed the CNN to get the optimized information, and the recognition rate was further improved, which proves the effectiveness of the CNN-SSAE network.
引用
收藏
页数:20
相关论文
共 50 条
  • [21] 1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features
    Mustaqeem
    Kwon, Soonil
    CMC-COMPUTERS MATERIALS & CONTINUA, 2021, 67 (03): : 4039 - 4059
  • [22] A Dual-Complementary Acoustic Embedding Network Learned from Raw Waveform for Speech Emotion Recognition
    Huang, Tzu-Yun
    Li, Jeng-Lin
    Chang, Chun-Min
    Lee, Chi-Chun
    2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2019,
  • [23] The Application of Capsule Neural Network Based CNN for Speech Emotion Recognition
    Wen, Xin-Cheng
    Liu, Kun-Hong
    Zhang, Wei-Ming
    Jiang, Kai
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9356 - 9362
  • [24] Excitation Features of Speech for Emotion Recognition Using Neutral Speech as Reference
    Sudarsana Reddy Kadiri
    P. Gangamohan
    Suryakanth V. Gangashetty
    Paavo Alku
    B. Yegnanarayana
    Circuits, Systems, and Signal Processing, 2020, 39 : 4459 - 4481
  • [25] Excitation Features of Speech for Emotion Recognition Using Neutral Speech as Reference
    Kadin, Sudarsana Reddy
    Gangamohan, P.
    Gangashetty, Suryakanth, V
    Alku, Paavo
    Yegnanarayana, B.
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2020, 39 (09) : 4459 - 4481
  • [26] Learning Salient Features for Speech Emotion Recognition Using CNN
    Liu, Jiamu
    Han, Wenjing
    Ruan, Huabin
    Chen, Xiaomin
    Jiang, Dongmei
    Li, Haifeng
    2018 FIRST ASIAN CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII ASIA), 2018,
  • [27] Speech Emotion Recognition using XGBoost and CNN BLSTM with Attention
    He, Jingru
    Ren, Liyong
    2021 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, INTERNET OF PEOPLE, AND SMART CITY INNOVATIONS (SMARTWORLD/SCALCOM/UIC/ATC/IOP/SCI 2021), 2021, : 154 - 159
  • [28] Comparative Analysis of Windows for Speech Emotion Recognition Using CNN
    Teixeira, Felipe L.
    Soares, Salviano Pinto
    Abreu, J. L. Pio
    Oliveira, Paulo M.
    Teixeira, Joao P.
    OPTIMIZATION, LEARNING ALGORITHMS AND APPLICATIONS, PT I, OL2A 2023, 2024, 1981 : 233 - 248
  • [29] LPI Radar Signal Recognition Based on Dual-Channel CNN and Feature Fusion
    Quan, Daying
    Tang, Zeyu
    Wang, Xiaofeng
    Zhai, Wenchao
    Qu, Chongxiao
    SYMMETRY-BASEL, 2022, 14 (03):
  • [30] Dual-channel spectral weighting for robust speech recognition in mobile devices
    Lopez-Espejo, Ivan
    Peinado, Antonio M.
    Gomez, Angel M.
    Gonzalez, Jose A.
    DIGITAL SIGNAL PROCESSING, 2018, 75 : 13 - 24