Recognizing Semi-Natural and Spontaneous Speech Emotions Using Deep Neural Networks

被引:6
|
作者
Amjad, Ammar [1 ]
Khan, Lal [1 ]
Ashraf, Noman [2 ]
Mahmood, Muhammad Bilal [3 ]
Chang, Hsien-Tsung [1 ,4 ,5 ,6 ]
机构
[1] Chang Gung Univ, Coll Engn, Dept Comp Sci & Informat Engn, Taoyuan 33302, Taiwan
[2] Inst Politecn Nacl, CIC, Mexico City 7738, DF, Mexico
[3] Dalian Univ Technol, Dept Software Engn, Dalian 116024, Peoples R China
[4] Chang Gung Univ, Artificial Intelligence Res Ctr, Taoyuan 33302, Taiwan
[5] Chang Gung Mem Hosp, Dept Phys Med & Rehabil, Taoyuan 333, Taiwan
[6] Chang Gung Univ, Bachelor Program Artificial Intelligence, Taoyuan 33302, Taiwan
来源
IEEE ACCESS | 2022年 / 10卷
关键词
Feature extraction; Convolutional neural networks; Spectrogram; Databases; Emotion recognition; Speech recognition; Data models; Speech emotion recognition; convolutional neural network; data augmentation; long-short-term memory; spontaneous speech database; RECOGNITION FEATURES; SENTIMENT ANALYSIS; CLASSIFICATION;
D O I
10.1109/ACCESS.2022.3163712
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We needed to find deep emotional features to identify emotions from audio signals. Identifying emotions in spontaneous speech is a novel and challenging subject of research. Several convolutional neural network (CNN) models were used to learn deep segment-level auditory representations of augmented Mel spectrograms. The proposed study introduces a novel technique for recognizing semi-natural and spontaneous speech emotions based on 1D (Model A) and 2D (Model B) deep convolutional neural networks (DCNNs) with two layers of long-short-term memory (LSTM). Both models used raw speech data and augmented (mid, left, right, and side) segment level Mel spectrograms to learn local and global features. The architecture of both models consists of five local feature learning blocks (LFLBs), two LSTM layers, and a fully connected layer (FCL). In addition to learning local correlations and extracting hierarchical correlations, LFLB comprises two convolutional layers and a max-pooling layer. The LSTM layer learns long-term correlations from local features. The experiments illustrated that the proposed systems perform better than conventional methods. Model A achieved an average identification accuracy of 94.78% for speaker-dependent (SD) with a raw SAVEE dataset. With the IEMOCAP database, Model A achieved an average accuracy of an SD experiment with raw audio of 73.15%. In addition, Model A obtained identification accuracies of 97.19%, 94.09%, and 53.98% on SAVEE, IEMOCAP, and BAUM-1s, the databases for speaker-dependent (SD) experiments with an augmented Mel spectrogram, respectively. In contrast, Model B achieved identification accuracy of 96.85%, 88.80%, and 48.67% on SAVEE, IEMOCAP, and the BAUM-1s database for SI experiments with augmented reality Mel spectrogram, respectively.
引用
收藏
页码:37149 / 37163
页数:15
相关论文
共 50 条
  • [1] Semi-Natural and Spontaneous Speech Recognition Using Deep Neural Networks with Hybrid Features Unification
    Amjad, Ammar
    Khan, Lal
    Chang, Hsien-Tsung
    PROCESSES, 2021, 9 (12)
  • [2] Random Deep Belief Networks for Recognizing Emotions from Speech Signals
    Wen, Guihua
    Li, Huihui
    Huang, Jubing
    Li, Danyang
    Xun, Eryang
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2017, 2017
  • [3] Building Emotional Machines: Recognizing Image Emotions Through Deep Neural Networks
    Kim, Hye-Rin
    Kim, Yeong-Seok
    Kim, Seon Joo
    Lee, In-Kwon
    IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (11) : 2980 - 2992
  • [4] Semi-Supervised Learning for Spanish Speech Recognition Using Deep Neural Networks
    Rosario Campomanes-Alvarez, Blanca
    Quiros, Pelayo
    Fernandez, Bernardo
    APPLICATIONS OF INTELLIGENT SYSTEMS, 2018, 310 : 19 - 29
  • [5] Speech watermarking using Deep Neural Networks
    Pavlovic, Kosta
    Kovacevic, Slavko
    Durovic, Igor
    2020 28TH TELECOMMUNICATIONS FORUM (TELFOR), 2020, : 292 - 295
  • [6] Comparison of sensibilities of Japanese and Koreans in recognizing emotions from speech by using Bayesian networks
    Dept. of Computer Science and Engineering, Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya 466-8555, Japan
    Conf. Proc. IEEE Int. Conf. Syst. Man Cybern., 1600, (2866-2871):
  • [7] Comparison of Sensibilities of Japanese and Koreans in Recognizing Emotions from Speech by using Bayesian Networks
    Cho, Jangsik
    Kato, Shohei
    Itoh, Hidenori
    2009 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC 2009), VOLS 1-9, 2009, : 2866 - 2871
  • [8] Recognizing emotions from speech using a physical model
    Kitaoka, Norihide
    Segawa, Shuhei
    Nishimura, Ryota
    Takeda, Kazuya
    ACOUSTICAL SCIENCE AND TECHNOLOGY, 2018, 39 (02) : 167 - 170
  • [9] Recognition of Emotions in Speech Using Convolutional Neural Networks on Different Datasets
    Zielonka, Marta
    Piastowski, Artur
    Czyzewski, Andrzej
    Nadachowski, Pawel
    Operlejn, Maksymilian
    Kaczor, Kamil
    ELECTRONICS, 2022, 11 (22)
  • [10] Speech Activity Detection Using Deep Neural Networks
    Shahsavari, Sajad
    Sameti, Hossein
    Hadian, Hossein
    2017 25TH IRANIAN CONFERENCE ON ELECTRICAL ENGINEERING (ICEE), 2017, : 1564 - 1568