Bimodal variational autoencoder for audiovisual speech recognition

被引:0
|
作者
Hadeer M. Sayed
Hesham E. ElDeeb
Shereen A. Taie
机构
[1] Fayoum University,Department of Computer Science
[2] Electronics Research Institute,Department of Computer and Control
来源
Machine Learning | 2023年 / 112卷
关键词
Multimodal data fusion; Audiovisual speech recognition; Deep learning; Variational autoencoder; Cross-modality;
D O I
暂无
中图分类号
学科分类号
摘要
Multimodal fusion is the idea of combining information in a joint representation of multiple modalities. The goal of multimodal fusion is to improve the accuracy of results from classification or regression tasks. This paper proposes a Bimodal Variational Autoencoder (BiVAE) model for audiovisual features fusion. Reliance on audiovisual signals in a speech recognition task increases the recognition accuracy, especially when an audio signal is corrupted. The BiVAE model is trained and validated on the CUAVE dataset. Three classifiers have evaluated the fused audiovisual features: Long-short Term Memory, Deep Neural Network, and Support Vector Machine. The experiment involves the evaluation of the fused features in the case of whether two modalities are available or there is only one modality available (i.e., cross-modality). The experimental results display the superiority of the proposed model (BiVAE) of audiovisual features fusion over the state-of-the-art models by an average accuracy difference ≃\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\simeq $$\end{document} 3.28% and 13.28% for clean and noisy, respectively. Additionally, BiVAE outperforms the state-of-the-art models in the case of cross-modality by an accuracy difference ≃\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\simeq $$\end{document} 2.79% when the only audio signal is available and 1.88% when the only video signal is available. Furthermore, SVM satisfies the best recognition accuracy compared with other classifiers.
引用
收藏
页码:1201 / 1226
页数:25
相关论文
共 50 条
  • [1] Bimodal variational autoencoder for audiovisual speech recognition
    Sayed, Hadeer M.
    ElDeeb, Hesham E.
    Taie, Shereen A.
    [J]. MACHINE LEARNING, 2023, 112 (04) : 1201 - 1226
  • [2] Automatic Bimodal Audiovisual Speech Recognition: A Review
    Kandagal, Amaresh P.
    Udayashankara, V.
    [J]. 2014 INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING AND INFORMATICS (IC3I), 2014, : 940 - 945
  • [3] A multimodal dynamical variational autoencoder for audiovisual speech representation learning
    Sadok, Samir
    Leglaive, Simon
    Girin, Laurent
    Alameda-Pineda, Xavier
    Seguier, Renaud
    [J]. NEURAL NETWORKS, 2024, 172
  • [4] Whisper Speech Enhancement Using Joint Variational Autoencoder for Improved Speech Recognition
    Agrawal, Vikas
    Kumar, Shashi
    Rath, Shakti P.
    [J]. INTERSPEECH 2021, 2021, : 2706 - 2710
  • [5] Bimodal fusion of visual and speech data for audiovisual speaker recognition in noisy environment
    Chelali F.Z.
    [J]. International Journal of Information Technology, 2023, 15 (6) : 3135 - 3145
  • [6] ALIGNING AUDIOVISUAL FEATURES FOR AUDIOVISUAL SPEECH RECOGNITION
    Tao, Fei
    Busso, Carlos
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2018,
  • [7] A RECURRENT VARIATIONAL AUTOENCODER FOR SPEECH ENHANCEMENT
    Leglaive, Simon
    Alameda-Pineda, Xavier
    Girin, Laurent
    Horaud, Radu
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 371 - 375
  • [8] Variational Autoencoder with Global- and Medium Timescale Auxiliaries for Emotion Recognition from Speech
    Almotlak, Hussam
    Weber, Cornelius
    Qu, Leyuan
    Wermter, Stefan
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2020, PT I, 2020, 12396 : 529 - 540
  • [9] Speech Enhancement Using Dynamical Variational AutoEncoder
    Do, Hao D.
    [J]. INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2023, PT II, 2023, 13996 : 247 - 258
  • [10] A Disentangled Recurrent Variational Autoencoder for Speech Enhancement
    Yan, Hegen
    Lu, Zhihua
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1697 - 1702