Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformer

被引:0
|
作者
Wang, Huijuan [1 ]
Cui, Boyan [1 ]
Yuan, Quanbo [1 ]
Pu, Gangqiang [1 ]
Liu, Xueli [1 ]
Zhu, Jie [1 ]
机构
[1] North China Inst Aerosp Engn, Sch Comp, Langfang 065000, Peoples R China
来源
VISUAL COMPUTER | 2025年 / 41卷 / 03期
关键词
Lip-reading; 3D convolution; Visual transformer; Spatial-temporal features; Model light weighting;
D O I
10.1007/s00371-024-03515-y
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Lip-reading has attracted more and more attention in recent years, and has wide application prospects and value in areas such as human-computer interaction, surveillance and security and audiovisual speech recognition. However, research on lip-reading has been slow due to the complexity of dealing with the fine spatial features of small-sized images of continuous video frames and the temporal features between images. In this paper, to address the challenges in extracting visual spatial features, temporal features and model light weighting, we propose a high-precision, highly robust and lightweight lip-reading method, Mini-3DCvT, which combines visual transforms and 3D convolution to extract spatiotemporal feature of continuous images, and makes full use of the properties of convolution and transforms to effectively extract local and global features of continuous images, use weight transformation and weight distillation in the convolution and transformer structures for model compression, and then send the extracted features to a bidirectional gated recurrent unit for sequence modeling. Experimental results on the large-scale public lip-reading datasets LRW and LRW-1000 show that this paper's method achieves 88.3% and 57.1% recognition accuracy on both datasets, and effectively reduces the model computation and number of parameters, improving the overall performance of the lip-reading model.
引用
收藏
页码:1957 / 1969
页数:13
相关论文
共 50 条
  • [11] Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer
    Liu, Minghua
    Li, Wenjing
    He, Bo
    Wang, Chuanxu
    Qu, Lianen
    APPLIED SCIENCES-BASEL, 2025, 15 (05):
  • [12] A Lightweight Monocular 3D Face Reconstruction Method Based on Improved 3D Morphing Models
    You, Xingyi
    Wang, Yue
    Zhao, Xiaohu
    SENSORS, 2023, 23 (15)
  • [13] Anomaly Behavior Detection in Crowd via Lightweight 3D Convolution
    Wang, Jinfeng
    Xie, Xiongshen
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XII, ICIC 2024, 2024, 14873 : 131 - 142
  • [14] A shape feature extraction method based on 3D convolution masks
    Suzuki, Motofumi T.
    Yaginuma, Yoshitomo
    Yamada, Tsuneo
    Shimizu, Yasutaka
    ISM 2006: EIGHTH IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA, PROCEEDINGS, 2006, : 837 - +
  • [15] Applying Attention Transformer Module to 3D Lip Sequence Identification
    Pian, Xinyang
    Wang, Yu
    Zhang, Jie
    Computer Engineering and Applications, 2024, 60 (07) : 141 - 146
  • [16] Lip-Reading Classification of Turkish Digits Using Ensemble Learning Architecture Based on 3DCNN
    Erbey, Ali
    Barisci, Necaattin
    APPLIED SCIENCES-BASEL, 2025, 15 (02):
  • [17] 3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification
    Lu, Dening
    Xie, Qian
    Gao, Kyle
    Xu, Linlin
    Li, Jonathan
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (12) : 24854 - 24865
  • [18] LRCFormer: lightweight transformer based radar-camera fusion for 3D target detection
    Huang, Xiaohong
    Xu, Kunqiang
    Tian, Ziran
    SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)
  • [19] Multi-View Transformer for 3D Visual Grounding
    Huang, Shijia
    Chen, Yilun
    Jia, Jiaya
    Wang, Liwei
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15503 - 15512
  • [20] FDDCC-VSR: a lightweight video super-resolution network based on deformable 3D convolution and cheap convolution
    Wang, Xiaohu
    Yang, Xin
    Li, Hengrui
    Li, Tao
    VISUAL COMPUTER, 2024, : 3581 - 3593