Large-Scale Visual Speech Recognition

被引:30
|
作者
Shillingford, Brendan [1 ]
Assael, Yannis [1 ]
Hoffman, Matthew W. [1 ]
Paine, Thomas [1 ]
Hughes, Cian [1 ]
Prabhu, Utsav [2 ]
Liao, Hank [2 ]
Sak, Hasim [2 ]
Rao, Kanishka [2 ]
Bennett, Lorrayne [1 ]
Mulville, Marie [1 ]
Denil, Misha [1 ]
Coppin, Ben [1 ]
Laurie, Ben [1 ]
Senior, Andrew [1 ]
de Freitas, Nando [1 ]
机构
[1] DeepMind, London, England
[2] Google, Mountain View, CA 94043 USA
来源
关键词
visual speech recognition; lipreading;
D O I
10.21437/Interspeech.2019-1669
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This work presents a scalable solution to continuous visual speech recognition. To achieve this, we constructed the largest existing visual speech recognition dataset, consisting of pairs of transcriptions and video clips of faces speaking (3,886 hours of video). In tandem, we designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequences of phoneme distributions, and a phoneme-to-word speech decoder that outputs sequences of words. The proposed system achieves a word error rate (WER) of 40.9% as measured on a held-out set. In comparison, professional lipreaders achieve either 86.4% or 92.9% WER on the same dataset when having access to additional types of contextual information. Our approach significantly improves on previous lipreading approaches, including variants of LipNet and of Watch, Attend, and Spell (WAS), which are only capable of 89.8% and 76.8% WER respectively.
引用
收藏
页码:4135 / 4139
页数:5
相关论文
共 50 条
  • [1] Large-Scale Visual Font Recognition
    Chen, Guang
    Yang, Jianchao
    Jin, Hailin
    Brandt, Jonathan
    Shechtman, Eli
    Agarwala, Aseem
    Han, Tony X.
    [J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 3598 - 3605
  • [2] Exploring Transformers for Large-Scale Speech Recognition
    Lu, Liang
    Liu, Changliang
    Li, Jinyu
    Gong, Yifan
    [J]. INTERSPEECH 2020, 2020, : 5041 - 5045
  • [3] THE SPEECHTRANSFORMER FOR LARGE-SCALE MANDARIN CHINESE SPEECH RECOGNITION
    Zhao, Yuanyuan
    Li, Jie
    Wang, Xiaorui
    Li, Yan
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7095 - 7099
  • [4] Sparse Output Coding for Large-Scale Visual Recognition
    Zhao, Bin
    Xing, Eric P.
    [J]. 2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, : 3350 - 3357
  • [5] Embedding Visual Hierarchy With Deep Networks for Large-Scale Visual Recognition
    Zhao, Tianyi
    Zhang, Baopeng
    He, Ming
    Zhang, Wei
    Zhou, Ning
    Yu, Jun
    Fan, Jianping
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2018, 27 (10) : 4740 - 4755
  • [6] AN INVESTIGATION OF MONOTONIC TRANSDUCERS FOR LARGE-SCALE AUTOMATIC SPEECH RECOGNITION
    Moritz, Niko
    Seide, Frank
    Le, Duc
    Mahadeokar, Jay
    Fuegen, Christian
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 324 - 330
  • [7] Large-Scale Random Forest Language Models for Speech Recognition
    Su, Yi
    Jelinek, Frederick
    Khudanpur, Sanjeev
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 945 - 948
  • [8] Automatic Speech Recognition of Vietnamese for a New Large-Scale Corpus
    Tran, Linh Thi Thuc
    Kim, Han-Gyu
    La, Hoang Minh
    Pham, Su Van
    [J]. ELECTRONICS, 2024, 13 (05)
  • [9] LSSED: A LARGE-SCALE DATASET AND BENCHMARK FOR SPEECH EMOTION RECOGNITION
    Fan, Weiquan
    Xu, Xiangmin
    Xing, Xiaofen
    Chen, Weidong
    Huang, Dongyan
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 641 - 645
  • [10] Discriminative Learning of Relaxed Hierarchy for Large-scale Visual Recognition
    Gao, Tianshi
    Koller, Daphne
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2011, : 2072 - 2079