Speech Emotion Recognition Using Deep Learning Transfer Models and Explainable Techniques

被引:2
|
作者
Kim, Tae-Wan [1 ]
Kwak, Keun-Chang [1 ]
机构
[1] Chosun Univ, Dept Elect Engn, Interdisciplinary Program IT Bio Convergence Syst, Gwangju 61452, South Korea
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 04期
关键词
speech emotion recognition; explainable model; deep learning; YAMNet; VGGish; audible feature;
D O I
10.3390/app14041553
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
This study aims to establish a greater reliability compared to conventional speech emotion recognition (SER) studies. This is achieved through preprocessing techniques that reduce uncertainty elements, models that combine the structural features of each model, and the application of various explanatory techniques. The ability to interpret can be made more accurate by reducing uncertain learning data, applying data in different environments, and applying techniques that explain the reasoning behind the results. We designed a generalized model using three different datasets, and each speech was converted into a spectrogram image through STFT preprocessing. The spectrogram was divided into the time domain with overlapping to match the input size of the model. Each divided section is expressed as a Gaussian distribution, and the quality of the data is investigated by the correlation coefficient between distributions. As a result, the scale of the data is reduced, and uncertainty is minimized. VGGish and YAMNet are the most representative pretrained deep learning networks frequently used in conjunction with speech processing. In dealing with speech signal processing, it is frequently advantageous to use these pretrained models synergistically rather than exclusively, resulting in the construction of ensemble deep networks. And finally, various explainable models (Grad CAM, LIME, occlusion sensitivity) are used in analyzing classified results. The model exhibits adaptability to voices in various environments, yielding a classification accuracy of 87%, surpassing that of individual models. Additionally, output results are confirmed by an explainable model to extract essential emotional areas, converted into audio files for auditory analysis using Grad CAM in the time domain. Through this study, we enhance the uncertainty of activation areas that are generated by Grad CAM. We achieve this by applying the interpretable ability from previous studies, along with effective preprocessing and fusion models. We can analyze it from a more diverse perspective through other explainable techniques.
引用
收藏
页数:23
相关论文
共 50 条
  • [21] Speech Emotion Recognition Using Deep Learning on audio recordings
    Suganya, S.
    Charles, E. Y. A.
    2019 19TH INTERNATIONAL CONFERENCE ON ADVANCES IN ICT FOR EMERGING REGIONS (ICTER - 2019), 2019,
  • [22] Speech emotion recognition of Hindi speech using statistical and machine learning techniques
    Agrawal, Akshat
    Jain, Anurag
    JOURNAL OF INTERDISCIPLINARY MATHEMATICS, 2020, 23 (01) : 311 - 319
  • [23] Speech emotion recognition for psychotherapy: an analysis of traditional machine learning and deep learning techniques
    Shah, Nidhi
    Sood, Kanika
    Arora, Jayraj
    2023 IEEE 13TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE, CCWC, 2023, : 718 - 723
  • [24] Speech Emotion Recognition by Late Fusion of Linguistic and Acoustic Features using Deep Learning Models
    Sato, Kiyohide
    Kishi, Keita
    Kosaka, Tetsuo
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1013 - 1018
  • [25] Language dialect based speech emotion recognition through deep learning techniques
    Sukumar Rajendran
    Sandeep Kumar Mathivanan
    Prabhu Jayagopal
    Maheshwari Venkatasen
    Thanapal Pandi
    Manivannan Sorakaya Somanathan
    Muthamilselvan Thangaval
    Prasanna Mani
    International Journal of Speech Technology, 2021, 24 : 625 - 635
  • [26] Language dialect based speech emotion recognition through deep learning techniques
    Rajendran, Sukumar
    Mathivanan, Sandeep Kumar
    Jayagopal, Prabhu
    Venkatasen, Maheshwari
    Pandi, Thanapal
    Sorakaya Somanathan, Manivannan
    Thangaval, Muthamilselvan
    Mani, Prasanna
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 24 (03) : 625 - 635
  • [27] Emotion Recognition in Speech with Deep Learning Architectures
    Erdal, Mehmet
    Kaechele, Markus
    Schwenker, Friedhelm
    ARTIFICIAL NEURAL NETWORKS IN PATTERN RECOGNITION, 2016, 9896 : 298 - 311
  • [28] Deep Learning for Emotion Recognition on Small Datasets Using Transfer Learning
    Hong-Wei Ng
    Viet Dung Nguyen
    Vonikakis, Vassilios
    Winkler, Stefan
    ICMI'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2015, : 443 - 449
  • [29] Speech Emotion Recognition Using Deep Learning LSTM for Tamil Language
    Fernandes, Bennilo
    Mannepalli, Kasiprasad
    PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY, 2021, 29 (03): : 1915 - 1936
  • [30] Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms
    Satt, Aharon
    Rozenberg, Shai
    Hoory, Ron
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1089 - 1093