LCANet: End-to-End Lipreading with Cascaded Attention-CTC

被引:68
|
作者
Xu, Kai [1 ]
Li, Dawei [2 ]
Cassimatis, Nick [2 ]
Wang, Xiaolong [2 ]
机构
[1] Arizona State Univ, Tempe, AZ 85281 USA
[2] Samsung Res Amer, Mountain View, CA 94043 USA
关键词
Lipreading; ASR; attention mechanism; CTC; cascaded attention-CTC; deep neural network; 3D CNN; highway network; Bi-GRU;
D O I
10.1109/FG.2018.00088
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine lipreading is a special type of automatic speech recognition (ASR) which transcribes human speech by visually interpreting the movement of related face regions including lips, face, and tongue. Recently, deep neural network based lipreading methods show great potential and have exceeded the accuracy of experienced human lipreaders in some benchmark datasets. However, lipreading is still far from being solved, and existing methods tend to have high error rates on the wild data. In this paper, we propose LCANet, an end-to-end deep neural network based lipreading system. LCANet encodes input video frames using a stacked 3D convolutional neural network (CNN), highway network and bidirectional GRU network. The encoder effectively captures both short-term and long-term spatio-temporal information. More importantly, LCANet incorporates a cascaded attention-CTC decoder to generate output texts. By cascading CTC with attention, it partially eliminates the defect of the conditional independence assumption of CTC within the hidden neural layers, and this yields notably performance improvement as well as faster convergence. The experimental results show the proposed system achieves a 1.3% CER and 3.0% WER on the GRID corpus database, leading to a 12.3% improvement compared to the state-of-the-art methods.
引用
收藏
页码:548 / 555
页数:8
相关论文
共 50 条
  • [1] End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC
    Jeon, Sanghun
    Kim, Mun Sang
    [J]. SENSORS, 2022, 22 (09)
  • [2] Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
    Watanabe, Shinji
    Hori, Takaaki
    Kim, Suyoun
    Hershey, John R.
    Hayashi, Tomoki
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1240 - 1253
  • [3] Joint CTC/attention decoding for end-to-end speech recognition
    Hori, Takaaki
    Watanabe, Shinji
    Hershey, John R.
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 518 - 529
  • [4] Online Hybrid CTC/Attention Architecture for End-to-end Speech Recognition
    Miao, Haoran
    Cheng, Gaofeng
    Zhang, Pengyuan
    Li, Ta
    Yan, Yonghong
    [J]. INTERSPEECH 2019, 2019, : 2623 - 2627
  • [5] End-to-end recognition of streaming Japanese speech using CTC and local attention
    Chen, Jiahao
    Nishimura, Ryota
    Kitaoka, Norihide
    [J]. APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2020, 9 (01)
  • [6] Hybrid CTC/Attention End-to-End Chinese Speech Recognition Enhanced by Conformer
    使用 Conformer 增强的混合 CTC/Attention 端到端中文语音识别
    [J]. 2024, 59 (04) : 97 - 103
  • [7] Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture
    Miao, Haoran
    Cheng, Gaofeng
    Zhang, Pengyuan
    Yan, Yonghong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1452 - 1465
  • [8] Exploring Hybrid CTC/Attention End-to-End Speech Recognition with Gaussian Processes
    Kuerzinger, Ludwig
    Watzel, Tobias
    Li, Lujun
    Baumgartner, Robert
    Rigoll, Gerhard
    [J]. SPEECH AND COMPUTER, SPECOM 2019, 2019, 11658 : 258 - 269
  • [9] Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition
    Wu, Long
    Li, Ta
    Wang, Li
    Yan, Yonghong
    [J]. APPLIED SCIENCES-BASEL, 2019, 9 (21):
  • [10] TRANSFORMER-BASED ONLINE CTC/ATTENTION END-TO-END SPEECH RECOGNITION ARCHITECTURE
    Miao, Haoran
    Cheng, Gaofeng
    Gao, Changfeng
    Zhang, Pengyuan
    Yan, Yonghong
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6084 - 6088