Bimodal variational autoencoder for audiovisual speech recognition

被引:0
|
作者
Hadeer M. Sayed
Hesham E. ElDeeb
Shereen A. Taie
机构
[1] Fayoum University,Department of Computer Science
[2] Electronics Research Institute,Department of Computer and Control
来源
Machine Learning | 2023年 / 112卷
关键词
Multimodal data fusion; Audiovisual speech recognition; Deep learning; Variational autoencoder; Cross-modality;
D O I
暂无
中图分类号
学科分类号
摘要
Multimodal fusion is the idea of combining information in a joint representation of multiple modalities. The goal of multimodal fusion is to improve the accuracy of results from classification or regression tasks. This paper proposes a Bimodal Variational Autoencoder (BiVAE) model for audiovisual features fusion. Reliance on audiovisual signals in a speech recognition task increases the recognition accuracy, especially when an audio signal is corrupted. The BiVAE model is trained and validated on the CUAVE dataset. Three classifiers have evaluated the fused audiovisual features: Long-short Term Memory, Deep Neural Network, and Support Vector Machine. The experiment involves the evaluation of the fused features in the case of whether two modalities are available or there is only one modality available (i.e., cross-modality). The experimental results display the superiority of the proposed model (BiVAE) of audiovisual features fusion over the state-of-the-art models by an average accuracy difference ≃\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\simeq $$\end{document} 3.28% and 13.28% for clean and noisy, respectively. Additionally, BiVAE outperforms the state-of-the-art models in the case of cross-modality by an accuracy difference ≃\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\simeq $$\end{document} 2.79% when the only audio signal is available and 1.88% when the only video signal is available. Furthermore, SVM satisfies the best recognition accuracy compared with other classifiers.
引用
收藏
页码:1201 / 1226
页数:25
相关论文
共 50 条
  • [31] Speech Emotion Recognition 'in the wild' Using an Autoencoder
    Dissanayake, Vipula
    Zhang, Haimo
    Billinghurst, Mark
    Nanayakkara, Suranga
    [J]. INTERSPEECH 2020, 2020, : 526 - 530
  • [32] Autoencoder with emotion embedding for speech emotion recognition
    Zhang, Chenghao
    Xue, Lei
    [J]. IEEE Access, 2021, 9 : 51231 - 51241
  • [33] Variational autoencoder for prosody-based speaker recognition
    Ben Alex, Starlet
    Mary, Leena
    [J]. ETRI JOURNAL, 2023, 45 (04) : 678 - 689
  • [34] A Variational Graph Autoencoder for Manipulation Action Recognition and Prediction
    Akyol, Gamze
    Sariel, Sanem
    Aksoy, Eren Erdal
    [J]. 2021 20TH INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS (ICAR), 2021, : 968 - 973
  • [35] A Physically Constrained Variational Autoencoder for Geochemical Pattern Recognition
    Xiong, Yihui
    Zuo, Renguang
    Luo, Zijing
    Wang, Xueqiu
    [J]. MATHEMATICAL GEOSCIENCES, 2022, 54 (04) : 783 - 806
  • [36] A Physically Constrained Variational Autoencoder for Geochemical Pattern Recognition
    Yihui Xiong
    Renguang Zuo
    Zijing Luo
    Xueqiu Wang
    [J]. Mathematical Geosciences, 2022, 54 : 783 - 806
  • [37] A review of speech-based bimodal recognition
    Chibelushi, CC
    Deravi, F
    Mason, JSD
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2002, 4 (01) : 23 - 37
  • [38] Bimodal Emotion Recognition from Speech and Text
    Ye, Weilin
    Fan, Xinghua
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2014, 5 (02) : 26 - 29
  • [39] Predicting Head Pose from Speech with a Conditional Variational Autoencoder
    Greenwood, David
    Laycock, Stephen
    Matthews, Iain
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3991 - 3995
  • [40] Speech Source Separation Using Variational Autoencoder and Bandpass Filter
    Do, Hao Duc
    Tran, Son Thai
    Chau, Duc Thanh
    [J]. IEEE ACCESS, 2020, 8 : 156219 - 156231