On the Effect of Log-Mel Spectrogram Parameter Tuning for Deep Learning-Based Speech Emotion Recognition

被引:1
|
作者
Mukhamediya, Azamat [1 ]
Fazli, Siamac [2 ]
Zollanvari, Amin [1 ]
机构
[1] Nazarbayev Univ, Sch Engn & Digital Sci, Dept Elect & Comp Engn, Astana 010000, Kazakhstan
[2] Nazarbayev Univ, Sch Engn & Digital Sci, Dept Comp Sci, Astana 010000, Kazakhstan
关键词
Log-Mel spectrogram; speech emotion recognition; SqueezeNet; NEURAL-NETWORKS;
D O I
10.1109/ACCESS.2023.3287093
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech emotion recognition (SER) has become a major area of investigation in human-computer interaction. Conventionally, SER is formulated as a classification problem that follows a common methodology: (i) extracting features from speech signals; and (ii) constructing an emotion classifier using extracted features. With the advent of deep learning, however, the former stage is integrated into the latter. That is to say, deep neural networks (DNNs), which are trained using log-Mel spectrograms (LMS) of audio waveforms, extract discriminative features from LMS. A critical issue, and one that is often overlooked, is that this procedure is done without relating the choice of LMS parameters to the performance of the trained DNN classifiers. It is commonplace in SER studies that practitioners assume some "usual" values for these parameters and devote major efforts to training and comparing various DNN architectures. In contrast with this common approach, in this work we choose a single lightweight pre-trained architecture, namely, SqueezeNet, and shift our main effort into tuning LMS parameters. Our empirical results using three publicly available SER datasets show that: (i) parameters of LMS can considerably affect the performance of DNNs; and (ii) by tuning LMS parameters, highly competitive classification performance can be achieved. In particular, treating LMS parameters as hyperparameters and tuning them led to similar to 23%, similar to 10%, and similar to 11% improvement in contrast with the use of "usual" values of LMS parameters in EmoDB, IEMOCAP, and SAVEE datasets, respectively.
引用
收藏
页码:61950 / 61957
页数:8
相关论文
共 50 条
  • [1] Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network
    Meng, Hao
    Yan, Tianhao
    Yuan, Fei
    Wei, Hongwei
    [J]. IEEE ACCESS, 2019, 7 : 125868 - 125881
  • [2] Heart Sound Classification Using Deep Learning Techniques Based on Log-mel Spectrogram
    Minh Tuan Nguyen
    Wei Wen Lin
    Jin H. Huang
    [J]. Circuits, Systems, and Signal Processing, 2023, 42 : 344 - 360
  • [3] Heart Sound Classification Using Deep Learning Techniques Based on Log-mel Spectrogram
    Minh Tuan Nguyen
    Lin, Wei Wen
    Huang, Jin H.
    [J]. CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2023, 42 (01) : 344 - 360
  • [4] AN EXPLORATION OF LOG-MEL SPECTROGRAM AND MFCC FEATURES FOR ALZHEIMER'S DEMENTIA RECOGNITION FROM SPONTANEOUS SPEECH
    Meghanani, Amit
    Anoop, C. S.
    Ramakrishnan, A. G.
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 670 - 677
  • [5] Speech-Based Emotion Analysis Using Log-Mel Spectrograms and MFCC Features
    Yetkin, Ahmet Kemal
    Kose, Hatice
    [J]. 2023 31ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU, 2023,
  • [6] Differential Time-frequency Log-mel Spectrogram Features for Vision Transformer Based Infant Cry Recognition
    Xu, Hai-tao
    Zhang, Jie
    Dai, Li-rong
    [J]. INTERSPEECH 2022, 2022, : 1963 - 1967
  • [7] MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers
    Li, Hui
    Li, Jiawen
    Liu, Hai
    Liu, Tingting
    Chen, Qiang
    You, Xinge
    [J]. SENSORS, 2024, 24 (17)
  • [8] EEG driving fatigue detection based on log-Mel spectrogram and convolutional recurrent neural networks
    Gao, Dongrui
    Tang, Xue
    Wan, Manqing
    Huang, Guo
    Zhang, Yongqing
    [J]. FRONTIERS IN NEUROSCIENCE, 2023, 17
  • [9] The Efficacy of Deep Learning-Based Mixed Model for Speech Emotion Recognition
    Uddin, Mohammad Amaz
    Chowdury, Mohammad Salah Uddin
    Khandaker, Mayeen Uddin
    Tamam, Nissren
    Sulieman, Abdelmoneim
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 74 (01): : 1709 - 1722
  • [10] A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face
    Lian, Hailun
    Lu, Cheng
    Li, Sunan
    Zhao, Yan
    Tang, Chuangao
    Zong, Yuan
    [J]. ENTROPY, 2023, 25 (10)