An arabic visual speech recognition framework with CNN and vision transformers for lipreading

被引:0
|
作者
Baaloul, Ali [1 ]
Benblidia, Nadjia [1 ]
Reguieg, Fatma Zohra [1 ]
Bouakkaz, Mustapha [2 ]
Felouat, Hisham [1 ]
机构
[1] Blida1 Univ, Fac Sci, LRDSI Lab, Blida, Algeria
[2] Laghouat Univ, LIM Lab, Laghouat, Algeria
关键词
Lip-readings; Visual speech recognition; Audiovisual dataset; CNN; Vision transformer; INTEGRATION;
D O I
10.1007/s11042-024-18237-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Individuals with hearing impairments often rely on non-verbal communication, including facial expressions and gestures. systems for Visual Speech Recognition (VSR) face challenges due to insufficient datasets and the complexity of extracting nuanced lip movements. In response, Our aim focuses on providing a two-fold framework, BlidAVS10. Firstly, we concentrate on the creation of a robust Arabic audio-visual dataset, comprising 1,383 videos. Secondly, we introduce an innovative approach to Arabic Audio-Visual Speech Recognition, leveraging BlidAVS10 for the development of various VSR systems. BlidAVS10 includes four key services: (1) the creation of a comprehensive dataset through video generation, (2) the detection, tracking, and extraction of the mouth region within each video frame, (3) the selection and customization of VSR models by developers, and (4) the building, training, and evaluation of our Deep Learning (DL) models, featuring a multi-layer Convolutional Neural Networks (CNN) model and a vision transformer (ViT). Our extensive experiments on BlidAVS10 showcase the effectiveness and reliability of our recognition techniques under varying environmental conditions. The dataset and DL-based VSR systems achieved a commendable accuracy rate of nearly 98%. This work introduces BlidAVS10, a groundbreaking audio-visual database, and offers a versatile framework with potential applications in assisting the hard of hearing, securing access through lipreading, enabling soundless communication with machines, and supporting the medical field in understanding the needs of laryngeal cancer patients.
引用
收藏
页码:69989 / 70023
页数:35
相关论文
共 50 条
  • [41] Diacritics Effect on Arabic Speech Recognition
    Sa’ed Abed
    Mohammad Alshayeji
    Sari Sultan
    Arabian Journal for Science and Engineering, 2019, 44 : 9043 - 9056
  • [42] A Comparative Study of Arabic Speech Recognition
    Ali, Onsy Abdel Alim
    Moselhy, Mohamed M.
    Bzeih, Aya
    2012 16TH IEEE MEDITERRANEAN ELECTROTECHNICAL CONFERENCE (MELECON), 2012, : 884 - 887
  • [43] An Investigation in Speech Recognition for Colloquial Arabic
    Al-Shareef, Sarah
    Hain, Thomas
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2880 - 2883
  • [44] Arabic speech synthesis and diacritic recognition
    Rebai, Ilyes
    BenAyed, Yassine
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2016, 19 (03) : 485 - 494
  • [45] Diacritics Effect on Arabic Speech Recognition
    Abed, Sa'ed
    Alshayeji, Mohammad
    Sultan, Sari
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2019, 44 (11) : 9043 - 9056
  • [46] Syntactic Features for Arabic Speech Recognition
    Kuo, Hong-Kwang Jeff
    Mangu, Lidia
    Emami, Ahmad
    Zitouni, Imed
    Lee, Young-Suk
    2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 327 - 332
  • [47] Arabic Automatic Speech Recognition Enhancement
    Ahmed, Basem H. A.
    Ghabayen, Ayman S.
    2017 PALESTINIAN INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY (PICICT), 2017, : 98 - 102
  • [48] Arabic Speech Act Recognition Techniques
    Sherkawi, Lina
    Ghneim, Nada
    Al Dakkak, Oumayma
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2018, 17 (03)
  • [49] Evolutionary structure of hidden Markov models for audio-visual Arabic speech recognition
    Makhlouf, Amina
    Lazli, Lilia
    Bensaker, Bachir
    INTERNATIONAL JOURNAL OF SIGNAL AND IMAGING SYSTEMS ENGINEERING, 2016, 9 (01) : 55 - 66
  • [50] Speech Emotion Recognition Using CNN
    Huang, Zhengwei
    Dong, Ming
    Mao, Qirong
    Zhan, Yongzhao
    PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 801 - 804