An arabic visual speech recognition framework with CNN and vision transformers for lipreading

被引:0
|
作者
Baaloul, Ali [1 ]
Benblidia, Nadjia [1 ]
Reguieg, Fatma Zohra [1 ]
Bouakkaz, Mustapha [2 ]
Felouat, Hisham [1 ]
机构
[1] Blida1 Univ, Fac Sci, LRDSI Lab, Blida, Algeria
[2] Laghouat Univ, LIM Lab, Laghouat, Algeria
关键词
Lip-readings; Visual speech recognition; Audiovisual dataset; CNN; Vision transformer; INTEGRATION;
D O I
10.1007/s11042-024-18237-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Individuals with hearing impairments often rely on non-verbal communication, including facial expressions and gestures. systems for Visual Speech Recognition (VSR) face challenges due to insufficient datasets and the complexity of extracting nuanced lip movements. In response, Our aim focuses on providing a two-fold framework, BlidAVS10. Firstly, we concentrate on the creation of a robust Arabic audio-visual dataset, comprising 1,383 videos. Secondly, we introduce an innovative approach to Arabic Audio-Visual Speech Recognition, leveraging BlidAVS10 for the development of various VSR systems. BlidAVS10 includes four key services: (1) the creation of a comprehensive dataset through video generation, (2) the detection, tracking, and extraction of the mouth region within each video frame, (3) the selection and customization of VSR models by developers, and (4) the building, training, and evaluation of our Deep Learning (DL) models, featuring a multi-layer Convolutional Neural Networks (CNN) model and a vision transformer (ViT). Our extensive experiments on BlidAVS10 showcase the effectiveness and reliability of our recognition techniques under varying environmental conditions. The dataset and DL-based VSR systems achieved a commendable accuracy rate of nearly 98%. This work introduces BlidAVS10, a groundbreaking audio-visual database, and offers a versatile framework with potential applications in assisting the hard of hearing, securing access through lipreading, enabling soundless communication with machines, and supporting the medical field in understanding the needs of laryngeal cancer patients.
引用
收藏
页码:69989 / 70023
页数:35
相关论文
共 50 条
  • [31] A unified pruning framework for vision transformers
    Hao YU
    Jianxin WU
    Science China(Information Sciences), 2023, 66 (07) : 303 - 304
  • [32] A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognition
    Chandrabanshi, Vishnu
    Domnic, S.
    SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (6-7) : 5433 - 5448
  • [33] Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition
    Seo, Minji
    Kim, Myungho
    SENSORS, 2020, 20 (19) : 1 - 21
  • [34] MaxMViT-MLP: Multiaxis and Multiscale Vision Transformers Fusion Network for Speech Emotion Recognition
    Ong, Kah Liang
    Lee, Chin Poo
    Lim, Heng Siong
    Lim, Kian Ming
    Alqahtani, Ali
    IEEE ACCESS, 2024, 12 : 18237 - 18250
  • [35] Automatic recognition of Arabic dysarthric speech
    Tolba, Hesham M.
    El-Torgoman, Ahmed S.
    AEJ - Alexandria Engineering Journal, 2010, 49 (02): : 131 - 138
  • [36] Arabic Phonetic Dictionaries for Speech Recognition
    Ali, Mohamed
    Elshafei, Moustafa
    Al-Ghamdi, Mansour
    Al-Muhtaseb, Husni
    Al-Najjar, Atef
    JOURNAL OF INFORMATION TECHNOLOGY RESEARCH, 2009, 2 (04) : 67 - 80
  • [37] Literature Survey of Arabic Speech Recognition
    Al-Anzi, Fawaz S.
    AbuZeina, Dia
    PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON COMPUTING SCIENCES AND ENGINEERING (ICCSE), 2018,
  • [38] Weeds Classification with Deep Learning: An Investigation Using CNN, Vision Transformers, Pyramid Vision Transformers, and Ensemble Strategy
    Rozendo, Guilherme Botazzo
    Roberto, Guilherme Freire
    Zanchetta do Nascimento, Marcelo
    Neves, Leandro Alves
    Lumini, Alessandra
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2023, PT I, 2024, 14469 : 229 - 243
  • [39] Survey on Arabic speech emotion recognition
    Iben Nasr L.
    Masmoudi A.
    Hadrich Belguith L.
    International Journal of Speech Technology, 2024, 27 (01) : 53 - 68
  • [40] Arabic Speech Recognition: Advancement and Challenges
    Rahman, Ashifur
    Kabir, Md. Mohsin
    Mridha, M. F.
    Alatiyyah, Mohammed
    Alhasson, Haifa F.
    Alharbi, Shuaa S.
    IEEE ACCESS, 2024, 12 : 39689 - 39716