MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition

被引:1
|
作者
Ong, Kah Liang [1 ]
Lee, Chin Poo [1 ]
Lim, Heng Siong [2 ]
Lim, Kian Ming [1 ]
Alqahtani, Ali [3 ,4 ]
机构
[1] Multimedia Univ, Fac Informat Sci & Technol, Melaka 75450, Malaysia
[2] Multimedia Univ, Fac Engn & Technol, Melaka 75450, Malaysia
[3] King Khalid Univ, Dept Comp Sci, Abha 61421, Saudi Arabia
[4] King Khalid Univ, Ctr Artificial Intelligence CAI, Abha 61421, Saudi Arabia
关键词
Speech recognition; Emotion recognition; Spectrogram; Feature extraction; Support vector machines; Transformers; Mel frequency cepstral coefficient; Ensemble learning; Visualization; Speech emotion recognition; ensemble learning; spectrogram; vision transformer; Emo-DB; RAVDESS; IEMOCAP;
D O I
10.1109/ACCESS.2024.3360483
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Vision Transformers, known for their innovative architectural design and modeling capabilities, have gained significant attention in computer vision. This paper presents a dual-path approach that leverages the strengths of the Multi-Axis Vision Transformer (MaxViT) and the Improved Multiscale Vision Transformer (MViTv2). It starts by encoding speech signals into Constant-Q Transform (CQT) spectrograms and Mel Spectrograms with Short-Time Fourier Transform (Mel-STFT). The CQT spectrogram is then fed into the MaxViT model, while the Mel-STFT is input to the MViTv2 model to extract informative features from the spectrograms. These features are integrated and passed into a Multilayer Perceptron (MLP) model for final classification. This hybrid model is named the "MaxViT and MViTv2 Fusion Network with Multilayer Perceptron (MaxMViT-MLP)." The MaxMViT-MLP model achieves remarkable results with an accuracy of 95.28% on the Emo-DB, 89.12% on the RAVDESS dataset, and 68.39% on the IEMOCAP dataset, substantiating the advantages of integrating multiple audio feature representations and Vision Transformers in speech emotion recognition.
引用
收藏
页码:18237 / 18250
页数:14
相关论文
共 50 条
  • [1] Mel-MViTv2: Enhanced Speech Emotion Recognition With Mel Spectrogram and Improved Multiscale Vision Transformers
    Ong, Kah Liang
    Lee, Chin Poo
    Lim, Heng Siong
    Lim, Kian Ming
    Alqahtani, Ali
    IEEE ACCESS, 2023, 11 : 108571 - 108579
  • [2] Speech emotion recognition based on multimodal and multiscale feature fusion
    Huangshui Hu
    Jie Wei
    Hongyu Sun
    Chuhang Wang
    Shuo Tao
    Signal, Image and Video Processing, 2025, 19 (2)
  • [3] Applying Generative Adversarial Networks and Vision Transformers in Speech Emotion Recognition
    Heracleous, Panikos
    Fukayama, Satoru
    Ogata, Jun
    Mohammad, Yasser
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2022, 13519 LNCS : 67 - 75
  • [4] Artwork Style Recognition Using Vision Transformers and MLP Mixer
    Iliadis, Lazaros Alexios
    Nikolaidis, Spyridon
    Sarigiannidis, Panagiotis
    Wan, Shaohua
    Goudos, Sotirios K.
    TECHNOLOGIES, 2022, 10 (01)
  • [5] ViTFER: Facial Emotion Recognition with Vision Transformers
    Chaudhari, Aayushi
    Bhatt, Chintan
    Krishna, Achyut
    Mazzeo, Pier Luigi
    APPLIED SYSTEM INNOVATION, 2022, 5 (04)
  • [6] Emotion Recognition via Multiscale Feature Fusion Network and Attention Mechanism
    Jiang, Yiye
    Xie, Songyun
    Xie, Xinzhou
    Cui, Yujie
    Tang, Hao
    IEEE SENSORS JOURNAL, 2023, 23 (10) : 10790 - 10800
  • [7] MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition
    Qi, Xin
    Wen, Yujun
    Zhang, Pengzhou
    Huang, Heyan
    Neurocomputing, 2025, 611
  • [8] Probing Speech Emotion Recognition Transformers for Linguistic Knowledge
    Triantafyllopoulos, Andreas
    Wagner, Johannes
    Wierstorf, Hagen
    Schmitt, Maximilian
    Reichel, Uwe
    Eyben, Florian
    Burkhardt, Felix
    Schuller, Bjoern W.
    INTERSPEECH 2022, 2022, : 146 - 150
  • [9] SPEAKER VGG CCT: Cross-corpus Speech Emotion Recognition with Speaker Embedding and Vision Transformers
    Arezzo, Alessandro
    Berretti, Stefano
    PROCEEDINGS OF THE 4TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA IN ASIA, MMASIA 2022, 2022,
  • [10] Effective MLP and CNN based ensemble learning for speech emotion recognition
    Middya A.I.
    Nag B.
    Roy S.
    Multimedia Tools and Applications, 2024, 83 (36) : 83963 - 83990