On the Effect of Log-Mel Spectrogram Parameter Tuning for Deep Learning-Based Speech Emotion Recognition

被引:3
|
作者
Mukhamediya, Azamat [1 ]
Fazli, Siamac [2 ]
Zollanvari, Amin [1 ]
机构
[1] Nazarbayev Univ, Sch Engn & Digital Sci, Dept Elect & Comp Engn, Astana 010000, Kazakhstan
[2] Nazarbayev Univ, Sch Engn & Digital Sci, Dept Comp Sci, Astana 010000, Kazakhstan
关键词
Log-Mel spectrogram; speech emotion recognition; SqueezeNet; NEURAL-NETWORKS;
D O I
10.1109/ACCESS.2023.3287093
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech emotion recognition (SER) has become a major area of investigation in human-computer interaction. Conventionally, SER is formulated as a classification problem that follows a common methodology: (i) extracting features from speech signals; and (ii) constructing an emotion classifier using extracted features. With the advent of deep learning, however, the former stage is integrated into the latter. That is to say, deep neural networks (DNNs), which are trained using log-Mel spectrograms (LMS) of audio waveforms, extract discriminative features from LMS. A critical issue, and one that is often overlooked, is that this procedure is done without relating the choice of LMS parameters to the performance of the trained DNN classifiers. It is commonplace in SER studies that practitioners assume some "usual" values for these parameters and devote major efforts to training and comparing various DNN architectures. In contrast with this common approach, in this work we choose a single lightweight pre-trained architecture, namely, SqueezeNet, and shift our main effort into tuning LMS parameters. Our empirical results using three publicly available SER datasets show that: (i) parameters of LMS can considerably affect the performance of DNNs; and (ii) by tuning LMS parameters, highly competitive classification performance can be achieved. In particular, treating LMS parameters as hyperparameters and tuning them led to similar to 23%, similar to 10%, and similar to 11% improvement in contrast with the use of "usual" values of LMS parameters in EmoDB, IEMOCAP, and SAVEE datasets, respectively.
引用
收藏
页码:61950 / 61957
页数:8
相关论文
共 50 条
  • [11] A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face
    Lian, Hailun
    Lu, Cheng
    Li, Sunan
    Zhao, Yan
    Tang, Chuangao
    Zong, Yuan
    ENTROPY, 2023, 25 (10)
  • [12] Deep Learning-Based Emotion Recognition by Fusion of Facial Expressions and Speech Features
    Vardhan, Jasthi Vivek
    Chakravarti, Yelavarti Kalyan
    Chand, Annam Jitin
    2024 2ND WORLD CONFERENCE ON COMMUNICATION & COMPUTING, WCONF 2024, 2024,
  • [13] Deep transfer learning-based bird species classification using mel spectrogram images
    Baowaly, Mrinal Kanti
    Sarkar, Bisnu Chandra
    Walid, Md. Abul Ala
    Ahamad, Md. Martuza
    Singh, Bikash Chandra
    Alvarado, Eduardo Silva
    Ashraf, Imran
    Samad, Md. Abdus
    PLOS ONE, 2024, 19 (08):
  • [14] Multi-Input Speech Emotion Recognition Model Using Mel Spectrogram and GeMAPS
    Toyoshima, Itsuki
    Okada, Yoshifumi
    Ishimaru, Momoko
    Uchiyama, Ryunosuke
    Tada, Mayu
    SENSORS, 2023, 23 (03)
  • [15] Mel-MViTv2: Enhanced Speech Emotion Recognition With Mel Spectrogram and Improved Multiscale Vision Transformers
    Ong, Kah Liang
    Lee, Chin Poo
    Lim, Heng Siong
    Lim, Kian Ming
    Alqahtani, Ali
    IEEE ACCESS, 2023, 11 : 108571 - 108579
  • [16] Bi-Feature Selection Deep Learning-Based Techniques for Speech Emotion Recognition
    Akinpelu, Samson
    Viriri, Serestina
    ADVANCES IN VISUAL COMPUTING, ISVC 2024, PT I, 2025, 15046 : 345 - 356
  • [17] Emotion recognition based on AlexNet using speech spectrogram
    Park, Soeun
    Lee, Chul
    Kwon, Soonil
    Park, Neungsoo
    BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2018, 123 : 49 - 49
  • [18] Speech Emotion Recognition with Deep Learning
    Harar, Pavol
    Burget, Radim
    Dutta, Malay Kishore
    2017 4TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND INTEGRATED NETWORKS (SPIN), 2017, : 137 - 140
  • [19] Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recognition
    Lambamo, Wondimu
    Srinivasagan, Ramasamy
    Jifara, Worku
    APPLIED SCIENCES-BASEL, 2023, 13 (01):
  • [20] Deep Learning-based Telephony Speech Recognition in the Wild
    Han, Kyu J.
    Hahm, Seongjun
    Kim, Byung-Hak
    Kim, Jungsuk
    Lane, Ian
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1323 - 1327