Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformer

被引:0
|
作者
Wang, Huijuan [1 ]
Cui, Boyan [1 ]
Yuan, Quanbo [1 ]
Pu, Gangqiang [1 ]
Liu, Xueli [1 ]
Zhu, Jie [1 ]
机构
[1] North China Inst Aerosp Engn, Sch Comp, Langfang 065000, Peoples R China
来源
VISUAL COMPUTER | 2025年 / 41卷 / 03期
关键词
Lip-reading; 3D convolution; Visual transformer; Spatial-temporal features; Model light weighting;
D O I
10.1007/s00371-024-03515-y
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Lip-reading has attracted more and more attention in recent years, and has wide application prospects and value in areas such as human-computer interaction, surveillance and security and audiovisual speech recognition. However, research on lip-reading has been slow due to the complexity of dealing with the fine spatial features of small-sized images of continuous video frames and the temporal features between images. In this paper, to address the challenges in extracting visual spatial features, temporal features and model light weighting, we propose a high-precision, highly robust and lightweight lip-reading method, Mini-3DCvT, which combines visual transforms and 3D convolution to extract spatiotemporal feature of continuous images, and makes full use of the properties of convolution and transforms to effectively extract local and global features of continuous images, use weight transformation and weight distillation in the convolution and transformer structures for model compression, and then send the extracted features to a bidirectional gated recurrent unit for sequence modeling. Experimental results on the large-scale public lip-reading datasets LRW and LRW-1000 show that this paper's method achieves 88.3% and 57.1% recognition accuracy on both datasets, and effectively reduces the model computation and number of parameters, improving the overall performance of the lip-reading model.
引用
收藏
页码:1957 / 1969
页数:13
相关论文
共 50 条
  • [41] Survey of Convolution Operations Based on 3D Point Clouds
    Han B.
    Zhang X.
    Ren S.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (04): : 873 - 902
  • [42] Lip animation based on observed 3D speech dynamics
    Kalberer, GA
    Van Gool, L
    VIDEOMETRICS AND OPTICAL METHODS FOR 3D SHAPE MEASUREMENT, 2001, 4309 : 16 - 25
  • [43] 3D navigation based on a visual memory
    Remazeilles, Anthony
    Chaumette, Francois
    Gros, Patrick
    2006 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), VOLS 1-10, 2006, : 2719 - +
  • [44] A CONVOLUTION-SUBTRACTION SCATTER CORRECTION METHOD FOR 3D PET
    BAILEY, DL
    MEIKLE, SR
    PHYSICS IN MEDICINE AND BIOLOGY, 1994, 39 (03): : 411 - 424
  • [45] Audio-to-Deep-Lip: Speaking lip synthesis based on 3D landmarks
    Fang, Hui
    Weng, Dongdong
    Tian, Zeyu
    Ma, Yin
    Lu, Xiangju
    COMPUTERS & GRAPHICS-UK, 2024, 120
  • [46] Learning Modal and Spatial Features With Lightweight 3D Convolution for RGB Guided Depth Completion
    Zhao Tao
    Pan Shuguo
    Gao Wang
    Sun Yingchun
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2021, 67 (03) : 195 - 201
  • [47] Lip syncing method for realistic expressive 3D face model
    Ali, Itimad Raheem
    Kolivand, Hoshang
    Alkawaz, Mohammed Hazim
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (05) : 5323 - 5366
  • [48] Lip syncing method for realistic expressive 3D face model
    Itimad Raheem Ali
    Hoshang Kolivand
    Mohammed Hazim Alkawaz
    Multimedia Tools and Applications, 2018, 77 : 5323 - 5366
  • [49] A 3D Point Cloud Classification Method Based on Adaptive Graph Convolution and Global Attention
    Yue, Yaowei
    Li, Xiaonan
    Peng, Yun
    SENSORS, 2024, 24 (02)
  • [50] A new 3D MRI segmentation method based on Generative Adversarial Network and Atrous Convolution
    Celik, Gaffari
    Talu, Muhammed Fatih
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2022, 71