Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformer

被引:0
|
作者
Wang, Huijuan [1 ]
Cui, Boyan [1 ]
Yuan, Quanbo [1 ]
Pu, Gangqiang [1 ]
Liu, Xueli [1 ]
Zhu, Jie [1 ]
机构
[1] North China Inst Aerosp Engn, Sch Comp, Langfang 065000, Peoples R China
来源
VISUAL COMPUTER | 2025年 / 41卷 / 03期
关键词
Lip-reading; 3D convolution; Visual transformer; Spatial-temporal features; Model light weighting;
D O I
10.1007/s00371-024-03515-y
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Lip-reading has attracted more and more attention in recent years, and has wide application prospects and value in areas such as human-computer interaction, surveillance and security and audiovisual speech recognition. However, research on lip-reading has been slow due to the complexity of dealing with the fine spatial features of small-sized images of continuous video frames and the temporal features between images. In this paper, to address the challenges in extracting visual spatial features, temporal features and model light weighting, we propose a high-precision, highly robust and lightweight lip-reading method, Mini-3DCvT, which combines visual transforms and 3D convolution to extract spatiotemporal feature of continuous images, and makes full use of the properties of convolution and transforms to effectively extract local and global features of continuous images, use weight transformation and weight distillation in the convolution and transformer structures for model compression, and then send the extracted features to a bidirectional gated recurrent unit for sequence modeling. Experimental results on the large-scale public lip-reading datasets LRW and LRW-1000 show that this paper's method achieves 88.3% and 57.1% recognition accuracy on both datasets, and effectively reduces the model computation and number of parameters, improving the overall performance of the lip-reading model.
引用
收藏
页码:1957 / 1969
页数:13
相关论文
共 50 条
  • [1] A Lip Reading Method Based on 3D Convolutional Vision Transformer
    Wang, Huijuan
    Pu, Gangqiang
    Chen, Tingyu
    IEEE ACCESS, 2022, 10 : 77205 - 77212
  • [2] Fast 3D NIR systems for facial measurement and lip-reading
    Brahm, Anika
    Ramm, Roland
    Heist, Stefan
    Rulff, Christian
    Kuhmstedt, Peter
    Notni, Gunther
    DIMENSIONAL OPTICAL METROLOGY AND INSPECTION FOR PRACTICAL APPLICATIONS VI, 2017, 10220
  • [3] HMM-based Lip Reading with Stingy Residual 3D Convolution
    Zeng, Qifeng
    Du, Jun
    Wang, Zirui
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 1438 - 1443
  • [4] A PCA based visual DCT feature extraction method for lip-reading
    Hong, Xiaopeng
    Yao, Hongxun
    Wan, Yuqi
    Chen, Rong
    IIH-MSP: 2006 INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATION HIDING AND MULTIMEDIA SIGNAL PROCESSING, PROCEEDINGS, 2006, : 321 - +
  • [5] Light3DHS: A lightweight 3D hippocampus segmentation method using multiscale convolution attention and vision transformer
    Xiao, Zhiyong
    Zhang, Yuhong
    Deng, Zhaohong
    Liu, Fei
    NEUROIMAGE, 2024, 292
  • [6] Lip Reading Using Deformable 3D Convolution and Channel-Temporal Attention
    Peng, Chen
    Li, Jun
    Chai, Jie
    Zhao, Zhongqiu
    Zhang, Housen
    Tian, Weidong
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT IV, 2022, 13532 : 707 - 718
  • [7] 3D Medical Axial Transformer: A Lightweight Transformer Model for 3D Brain Tumor Segmentation
    Liu, Cheng
    Kiryu, Hisanori
    MEDICAL IMAGING WITH DEEP LEARNING, VOL 227, 2023, 227 : 799 - 813
  • [8] Sign language recognition based on lightweight 3D CNNs and Transformer
    Lu F.
    Han X.
    Cheng X.
    Tian G.
    Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition), 2023, 51 (05): : 13 - 18
  • [9] A Genetic Algorithm-Based 3D Feature Selection for Lip Reading
    Morade, Sunil Sudam
    Patnaik, Suprava
    2015 INTERNATIONAL CONFERENCE ON PERVASIVE COMPUTING (ICPC), 2015,
  • [10] Tibetan lip reading based on D3D
    Gan, Zhenye
    Yu, Xinke
    Zeng, Hao
    Zhao, Tianqin
    2021 2ND INTERNATIONAL CONFERENCE ON BIG DATA & ARTIFICIAL INTELLIGENCE & SOFTWARE ENGINEERING (ICBASE 2021), 2021, : 439 - 442