Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformer

被引:0
|
作者
Wang, Huijuan [1 ]
Cui, Boyan [1 ]
Yuan, Quanbo [1 ]
Pu, Gangqiang [1 ]
Liu, Xueli [1 ]
Zhu, Jie [1 ]
机构
[1] North China Inst Aerosp Engn, Sch Comp, Langfang 065000, Peoples R China
来源
VISUAL COMPUTER | 2025年 / 41卷 / 03期
关键词
Lip-reading; 3D convolution; Visual transformer; Spatial-temporal features; Model light weighting;
D O I
10.1007/s00371-024-03515-y
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Lip-reading has attracted more and more attention in recent years, and has wide application prospects and value in areas such as human-computer interaction, surveillance and security and audiovisual speech recognition. However, research on lip-reading has been slow due to the complexity of dealing with the fine spatial features of small-sized images of continuous video frames and the temporal features between images. In this paper, to address the challenges in extracting visual spatial features, temporal features and model light weighting, we propose a high-precision, highly robust and lightweight lip-reading method, Mini-3DCvT, which combines visual transforms and 3D convolution to extract spatiotemporal feature of continuous images, and makes full use of the properties of convolution and transforms to effectively extract local and global features of continuous images, use weight transformation and weight distillation in the convolution and transformer structures for model compression, and then send the extracted features to a bidirectional gated recurrent unit for sequence modeling. Experimental results on the large-scale public lip-reading datasets LRW and LRW-1000 show that this paper's method achieves 88.3% and 57.1% recognition accuracy on both datasets, and effectively reduces the model computation and number of parameters, improving the overall performance of the lip-reading model.
引用
收藏
页码:1957 / 1969
页数:13
相关论文
共 50 条
  • [31] The Lightweight Fracture Segmentation Algorithm for Logging Images Based on Fully 3D Attention Mechanism and Deformable Convolution
    Yang, Qishun
    Zhang, Liyan
    Xi, Zihan
    Qian, Yu
    Li, Ang
    APPLIED SCIENCES-BASEL, 2024, 14 (22):
  • [32] Lightweight 3D convolution model for failure prediction of coal under uniaxial compression based on acoustic emission
    Zhao Y.
    Qiao H.
    Xie R.
    Guo J.
    Yanshilixue Yu Gongcheng Xuebao/Chinese Journal of Rock Mechanics and Engineering, 2022, 41 (08): : 1567 - 1580
  • [33] A Hybrid Transformer-LSTM Model With 3D Separable Convolution for Video Prediction
    Mathai, Mareeta
    Liu, Ying
    Ling, Nam
    IEEE ACCESS, 2024, 12 : 39589 - 39602
  • [34] SCGFormer: Semantic Chebyshev Graph Convolution Transformer for 3D Human Pose Estimation
    Liang, Jiayao
    Yin, Mengxiao
    APPLIED SCIENCES-BASEL, 2024, 14 (04):
  • [35] Weighted Sparse Convolution and Transformer Feature Aggregation Networks for 3D Dental Segmentation
    Ahn, Jung Su
    Cho, Young-Rae
    IEEE ACCESS, 2024, 12 : 135172 - 135184
  • [36] 3D visual component based player for script based 3D CG animation
    Yoshiyama, T
    Okada, Y
    Niijima, K
    Proceedings of the 2005 International Conference on Active Media Technology (AMT 2005), 2005, : 499 - 502
  • [37] Copula-based transformer in EEG to assess visual discomfort induced by stereoscopic 3D
    Zheng, Yawen
    Zhao, Xiaojie
    Yao, Li
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2022, 77
  • [38] Research on Data Lightweight Method for Real Scene 3D Based on Interactive Markup
    Lu, Xin
    2024 5TH INTERNATIONAL CONFERENCE ON GEOLOGY, MAPPING AND REMOTE SENSING, ICGMRS 2024, 2024, : 256 - 259
  • [39] A Novel 3D Magnetic Resonance Imaging Registration Framework Based on the Swin-Transformer UNet+ Model with 3D Dynamic Snake Convolution Scheme
    Han, Yaolong
    Wang, Lei
    Huang, Zizhen
    Zhang, Yukun
    Zheng, Xiao
    JOURNAL OF IMAGING, 2025, 11 (02)
  • [40] A review of video action recognition based on 3D convolution
    Huang, Xiankai
    Cai, Zhibin
    COMPUTERS & ELECTRICAL ENGINEERING, 2023, 108