Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformer

被引:0
|
作者
Wang, Huijuan [1 ]
Cui, Boyan [1 ]
Yuan, Quanbo [1 ]
Pu, Gangqiang [1 ]
Liu, Xueli [1 ]
Zhu, Jie [1 ]
机构
[1] North China Inst Aerosp Engn, Sch Comp, Langfang 065000, Peoples R China
来源
VISUAL COMPUTER | 2025年 / 41卷 / 03期
关键词
Lip-reading; 3D convolution; Visual transformer; Spatial-temporal features; Model light weighting;
D O I
10.1007/s00371-024-03515-y
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Lip-reading has attracted more and more attention in recent years, and has wide application prospects and value in areas such as human-computer interaction, surveillance and security and audiovisual speech recognition. However, research on lip-reading has been slow due to the complexity of dealing with the fine spatial features of small-sized images of continuous video frames and the temporal features between images. In this paper, to address the challenges in extracting visual spatial features, temporal features and model light weighting, we propose a high-precision, highly robust and lightweight lip-reading method, Mini-3DCvT, which combines visual transforms and 3D convolution to extract spatiotemporal feature of continuous images, and makes full use of the properties of convolution and transforms to effectively extract local and global features of continuous images, use weight transformation and weight distillation in the convolution and transformer structures for model compression, and then send the extracted features to a bidirectional gated recurrent unit for sequence modeling. Experimental results on the large-scale public lip-reading datasets LRW and LRW-1000 show that this paper's method achieves 88.3% and 57.1% recognition accuracy on both datasets, and effectively reduces the model computation and number of parameters, improving the overall performance of the lip-reading model.
引用
收藏
页码:1957 / 1969
页数:13
相关论文
共 50 条
  • [21] Swin transformer with multiscale 3D atrous convolution for hyperspectral image classification
    Farooque, Ghulam
    Liu, Qichao
    Sargano, Allah Bux
    Xiao, Liang
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 126
  • [22] A 3D Medical Image Segmentation Framework Fusing Convolution and Transformer Features
    Zhu, Fazhan
    Lv, Jiaxing
    Lu, Kun
    Wang, Wenyan
    Cong, Hongshou
    Zhang, Jun
    Chen, Peng
    Zhao, Yuan
    Wu, Ziheng
    INTELLIGENT COMPUTING THEORIES AND APPLICATION (ICIC 2022), PT I, 2022, 13393 : 772 - 786
  • [23] UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding
    Chen, Dave Zhenyu
    Hu, Ronghang
    Chen, Xinlei
    Niessner, Matthias
    Chang, Angel X.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 18063 - 18073
  • [24] Development of the "mini 3D soil profile" - A visual method derived from the "profil cultural"
    Tomis, Vincent
    Duparque, Annie
    Boizard, Hubert
    SOIL & TILLAGE RESEARCH, 2019, 194
  • [25] Automatic Evaluation Method for Functional Movement Screening Based on Multi-Scale Lightweight 3D Convolution and an Encoder-Decoder
    Lin, Xiuchun
    Liu, Yichao
    Feng, Chen
    Chen, Zhide
    Yang, Xu
    Cui, Hui
    ELECTRONICS, 2024, 13 (10)
  • [26] Fusion information enhanced method based on transformer for 3D object detection
    Jin Y.
    Tao C.
    Yi Qi Yi Biao Xue Bao/Chinese Journal of Scientific Instrument, 2023, 44 (12): : 297 - 306
  • [27] 3DVT: Hyperspectral Image Classification Using 3D Dilated Convolution and Mean Transformer
    Su, Xinling
    Shao, Jingbo
    PHOTONICS, 2025, 12 (02)
  • [28] Transformer-Based Global PointPillars 3D Object Detection Method
    Zhang, Lin
    Meng, Hua
    Yan, Yunbing
    Xu, Xiaowei
    ELECTRONICS, 2023, 12 (14)
  • [29] Lightweight brain MR image super-resolution using 3D convolution
    Kim, Young Beom
    Van Le, The
    Lee, Jin Young
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (03) : 8785 - 8795
  • [30] Lightweight brain MR image super-resolution using 3D convolution
    Young Beom Kim
    The Van Le
    Jin Young Lee
    Multimedia Tools and Applications, 2024, 83 : 8785 - 8795