Bimodal variational autoencoder for audiovisual speech recognition

被引：0

作者：

Hadeer M. Sayed

Hesham E. ElDeeb

Shereen A. Taie

机构：

[1] Fayoum University,Department of Computer Science

[2] Electronics Research Institute,Department of Computer and Control

来源：

Machine Learning | 2023年 / 112卷

关键词：

Multimodal data fusion; Audiovisual speech recognition; Deep learning; Variational autoencoder; Cross-modality;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Multimodal fusion is the idea of combining information in a joint representation of multiple modalities. The goal of multimodal fusion is to improve the accuracy of results from classification or regression tasks. This paper proposes a Bimodal Variational Autoencoder (BiVAE) model for audiovisual features fusion. Reliance on audiovisual signals in a speech recognition task increases the recognition accuracy, especially when an audio signal is corrupted. The BiVAE model is trained and validated on the CUAVE dataset. Three classifiers have evaluated the fused audiovisual features: Long-short Term Memory, Deep Neural Network, and Support Vector Machine. The experiment involves the evaluation of the fused features in the case of whether two modalities are available or there is only one modality available (i.e., cross-modality). The experimental results display the superiority of the proposed model (BiVAE) of audiovisual features fusion over the state-of-the-art models by an average accuracy difference ≃\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\simeq $$\end{document} 3.28% and 13.28% for clean and noisy, respectively. Additionally, BiVAE outperforms the state-of-the-art models in the case of cross-modality by an accuracy difference ≃\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\simeq $$\end{document} 2.79% when the only audio signal is available and 1.88% when the only video signal is available. Furthermore, SVM satisfies the best recognition accuracy compared with other classifiers.

引用

页码：1201 / 1226

页数：25

共 50 条

[1] Bimodal variational autoencoder for audiovisual speech recognition
Sayed, Hadeer M.
ElDeeb, Hesham E.
Taie, Shereen A.
[J]. MACHINE LEARNING, 2023, 112 (04) : 1201 - 1226
[2] Automatic Bimodal Audiovisual Speech Recognition: A Review
Kandagal, Amaresh P.
Udayashankara, V.
[J]. 2014 INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING AND INFORMATICS (IC3I), 2014, : 940 - 945
[3] A multimodal dynamical variational autoencoder for audiovisual speech representation learning
Sadok, Samir
Leglaive, Simon
Girin, Laurent
Alameda-Pineda, Xavier
Seguier, Renaud
[J]. NEURAL NETWORKS, 2024, 172
[4] Whisper Speech Enhancement Using Joint Variational Autoencoder for Improved Speech Recognition
Agrawal, Vikas
Kumar, Shashi
Rath, Shakti P.
[J]. INTERSPEECH 2021, 2021, : 2706 - 2710
[5] Bimodal fusion of visual and speech data for audiovisual speaker recognition in noisy environment
Chelali F.Z.
[J]. International Journal of Information Technology, 2023, 15 (6) : 3135 - 3145
[6] ALIGNING AUDIOVISUAL FEATURES FOR AUDIOVISUAL SPEECH RECOGNITION
Tao, Fei
Busso, Carlos
[J]. 2018 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2018,
[7] A RECURRENT VARIATIONAL AUTOENCODER FOR SPEECH ENHANCEMENT
Leglaive, Simon
Alameda-Pineda, Xavier
Girin, Laurent
Horaud, Radu
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 371 - 375
[8] Variational Autoencoder with Global- and Medium Timescale Auxiliaries for Emotion Recognition from Speech
Almotlak, Hussam
Weber, Cornelius
Qu, Leyuan
Wermter, Stefan
[J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2020, PT I, 2020, 12396 : 529 - 540
[9] Speech Enhancement Using Dynamical Variational AutoEncoder
Do, Hao D.
[J]. INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2023, PT II, 2023, 13996 : 247 - 258
[10] A Disentangled Recurrent Variational Autoencoder for Speech Enhancement
Yan, Hegen
Lu, Zhihua
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1697 - 1702

← 1 2 3 4 5 →