An arabic visual speech recognition framework with CNN and vision transformers for lipreading

被引:0
|
作者
Baaloul, Ali [1 ]
Benblidia, Nadjia [1 ]
Reguieg, Fatma Zohra [1 ]
Bouakkaz, Mustapha [2 ]
Felouat, Hisham [1 ]
机构
[1] Blida1 Univ, Fac Sci, LRDSI Lab, Blida, Algeria
[2] Laghouat Univ, LIM Lab, Laghouat, Algeria
关键词
Lip-readings; Visual speech recognition; Audiovisual dataset; CNN; Vision transformer; INTEGRATION;
D O I
10.1007/s11042-024-18237-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Individuals with hearing impairments often rely on non-verbal communication, including facial expressions and gestures. systems for Visual Speech Recognition (VSR) face challenges due to insufficient datasets and the complexity of extracting nuanced lip movements. In response, Our aim focuses on providing a two-fold framework, BlidAVS10. Firstly, we concentrate on the creation of a robust Arabic audio-visual dataset, comprising 1,383 videos. Secondly, we introduce an innovative approach to Arabic Audio-Visual Speech Recognition, leveraging BlidAVS10 for the development of various VSR systems. BlidAVS10 includes four key services: (1) the creation of a comprehensive dataset through video generation, (2) the detection, tracking, and extraction of the mouth region within each video frame, (3) the selection and customization of VSR models by developers, and (4) the building, training, and evaluation of our Deep Learning (DL) models, featuring a multi-layer Convolutional Neural Networks (CNN) model and a vision transformer (ViT). Our extensive experiments on BlidAVS10 showcase the effectiveness and reliability of our recognition techniques under varying environmental conditions. The dataset and DL-based VSR systems achieved a commendable accuracy rate of nearly 98%. This work introduces BlidAVS10, a groundbreaking audio-visual database, and offers a versatile framework with potential applications in assisting the hard of hearing, securing access through lipreading, enabling soundless communication with machines, and supporting the medical field in understanding the needs of laryngeal cancer patients.
引用
收藏
页码:69989 / 70023
页数:35
相关论文
共 50 条
  • [1] Multi-pose lipreading and audio-visual speech recognition
    Estellers, Virginia
    Thiran, Jean-Philippe
    EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2012, : 1 - 23
  • [2] Part-Based Lipreading for Audio-Visual Speech Recognition
    Miao, Ziling
    Liu, Hong
    Yang, Bing
    2020 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2020, : 2722 - 2726
  • [3] Multi-pose lipreading and audio-visual speech recognition
    Virginia Estellers
    Jean-Philippe Thiran
    EURASIP Journal on Advances in Signal Processing, 2012
  • [4] AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition
    Chen, Shoufa
    Ge, Chongjian
    Tong, Zhan
    Wang, Jiangliu
    Song, Yibing
    Wang, Jue
    Luo, Ping
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [5] Vision Transformers and Transfer Learning Approaches for Arabic Sign Language Recognition
    Alharthi, Nojood M.
    Alzahrani, Salha M.
    APPLIED SCIENCES-BASEL, 2023, 13 (21):
  • [6] A robust hierarchical lip tracking approach for lipreading and audio visual speech recognition
    Xie, L
    Cai, XL
    Fu, ZH
    Zhao, RC
    Jiang, DM
    PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2004, : 3620 - 3624
  • [7] Applying Generative Adversarial Networks and Vision Transformers in Speech Emotion Recognition
    Heracleous, Panikos
    Fukayama, Satoru
    Ogata, Jun
    Mohammad, Yasser
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2022, 13519 LNCS : 67 - 75
  • [8] An experimental framework for Arabic digits speech recognition in noisy environments
    Touazi A.
    Debyeche M.
    International Journal of Speech Technology, 2017, 20 (2) : 205 - 224
  • [9] Concatenated Frame Image Based CNN for Visual Speech Recognition
    Saitoh, Takeshi
    Zhou, Ziheng
    Zhao, Guoying
    Pietikainen, Matti
    COMPUTER VISION - ACCV 2016 WORKSHOPS, PT II, 2017, 10117 : 277 - 289
  • [10] Audio-visual speech recognition using lstm and cnn
    El Maghraby E.E.
    Gody A.M.
    Farouk M.H.
    Recent Advances in Computer Science and Communications, 2021, 14 (06) : 2023 - 2039