An arabic visual speech recognition framework with CNN and vision transformers for lipreading

被引:0
|
作者
Baaloul, Ali [1 ]
Benblidia, Nadjia [1 ]
Reguieg, Fatma Zohra [1 ]
Bouakkaz, Mustapha [2 ]
Felouat, Hisham [1 ]
机构
[1] Blida1 Univ, Fac Sci, LRDSI Lab, Blida, Algeria
[2] Laghouat Univ, LIM Lab, Laghouat, Algeria
关键词
Lip-readings; Visual speech recognition; Audiovisual dataset; CNN; Vision transformer; INTEGRATION;
D O I
10.1007/s11042-024-18237-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Individuals with hearing impairments often rely on non-verbal communication, including facial expressions and gestures. systems for Visual Speech Recognition (VSR) face challenges due to insufficient datasets and the complexity of extracting nuanced lip movements. In response, Our aim focuses on providing a two-fold framework, BlidAVS10. Firstly, we concentrate on the creation of a robust Arabic audio-visual dataset, comprising 1,383 videos. Secondly, we introduce an innovative approach to Arabic Audio-Visual Speech Recognition, leveraging BlidAVS10 for the development of various VSR systems. BlidAVS10 includes four key services: (1) the creation of a comprehensive dataset through video generation, (2) the detection, tracking, and extraction of the mouth region within each video frame, (3) the selection and customization of VSR models by developers, and (4) the building, training, and evaluation of our Deep Learning (DL) models, featuring a multi-layer Convolutional Neural Networks (CNN) model and a vision transformer (ViT). Our extensive experiments on BlidAVS10 showcase the effectiveness and reliability of our recognition techniques under varying environmental conditions. The dataset and DL-based VSR systems achieved a commendable accuracy rate of nearly 98%. This work introduces BlidAVS10, a groundbreaking audio-visual database, and offers a versatile framework with potential applications in assisting the hard of hearing, securing access through lipreading, enabling soundless communication with machines, and supporting the medical field in understanding the needs of laryngeal cancer patients.
引用
收藏
页码:69989 / 70023
页数:35
相关论文
共 50 条
  • [21] Emotion Recognition in Arabic Speech
    Klaylat, Samira
    Hamandi, Lama
    Osman, Ziad
    Zantout, Rached
    2017 SENSORS NETWORKS SMART AND EMERGING TECHNOLOGIES (SENSET), 2017,
  • [22] Emotion recognition in Arabic speech
    Hadjadji, Imene
    Falek, Leila
    Demri, Lyes
    Teffahi, Hocine
    2019 INTERNATIONAL CONFERENCE ON ADVANCED ELECTRICAL ENGINEERING (ICAEE), 2019,
  • [23] Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition
    Jeon, Sanghun
    Elsharkawy, Ahmed
    Kim, Mun Sang
    SENSORS, 2022, 22 (01)
  • [24] Emotion recognition in Arabic speech
    Samira Klaylat
    Ziad Osman
    Lama Hamandi
    Rached Zantout
    Analog Integrated Circuits and Signal Processing, 2018, 96 : 337 - 351
  • [25] Emotion recognition in Arabic speech
    Klaylat, Samira
    Osman, Ziad
    Hamandi, Lama
    Zantout, Rached
    ANALOG INTEGRATED CIRCUITS AND SIGNAL PROCESSING, 2018, 96 (02) : 337 - 351
  • [26] Bottleneck Transformers for Visual Recognition
    Srinivas, Aravind
    Lin, Tsung-Yi
    Parmar, Niki
    Shlens, Jonathon
    Abbeel, Pieter
    Vaswani, Ashish
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 16514 - 16524
  • [27] A new framework for Arabic recitation using speech recognition and the Jaro Winkler algorithm
    Larabi-Marie-Sainte, Souad
    Alnamlah, Betool S.
    Alkassim, Norah F.
    Alshathry, Sara Y.
    KUWAIT JOURNAL OF SCIENCE, 2022, 49 (01)
  • [28] Vision Transformers for Vein Biometric Recognition
    Garcia-Martin, Raul
    Sanchez-Reillo, Raul
    IEEE ACCESS, 2023, 11 : 22060 - 22080
  • [29] Multiresolution and Multimodal Speech Recognition with Transformers
    Paraskevopoulos, Georgios
    Parthasarathy, Srinivas
    Khare, Aparna
    Sundaram, Shiva
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 2381 - 2387
  • [30] A unified pruning framework for vision transformers
    Yu, Hao
    Wu, Jianxin
    SCIENCE CHINA-INFORMATION SCIENCES, 2023, 66 (07)