Large-Scale Visual Speech Recognition

被引:30
|
作者
Shillingford, Brendan [1 ]
Assael, Yannis [1 ]
Hoffman, Matthew W. [1 ]
Paine, Thomas [1 ]
Hughes, Cian [1 ]
Prabhu, Utsav [2 ]
Liao, Hank [2 ]
Sak, Hasim [2 ]
Rao, Kanishka [2 ]
Bennett, Lorrayne [1 ]
Mulville, Marie [1 ]
Denil, Misha [1 ]
Coppin, Ben [1 ]
Laurie, Ben [1 ]
Senior, Andrew [1 ]
de Freitas, Nando [1 ]
机构
[1] DeepMind, London, England
[2] Google, Mountain View, CA 94043 USA
来源
关键词
visual speech recognition; lipreading;
D O I
10.21437/Interspeech.2019-1669
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This work presents a scalable solution to continuous visual speech recognition. To achieve this, we constructed the largest existing visual speech recognition dataset, consisting of pairs of transcriptions and video clips of faces speaking (3,886 hours of video). In tandem, we designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequences of phoneme distributions, and a phoneme-to-word speech decoder that outputs sequences of words. The proposed system achieves a word error rate (WER) of 40.9% as measured on a held-out set. In comparison, professional lipreaders achieve either 86.4% or 92.9% WER on the same dataset when having access to additional types of contextual information. Our approach significantly improves on previous lipreading approaches, including variants of LipNet and of Watch, Attend, and Spell (WAS), which are only capable of 89.8% and 76.8% WER respectively.
引用
收藏
页码:4135 / 4139
页数:5
相关论文
共 50 条
  • [21] Speech Recognition with Large-Scale Speaker-Class-Based Acoustic Modeling
    Konno, Kazuki
    Kato, Masaharu
    Kosaka, Tetsuo
    [J]. 2013 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2013,
  • [22] Large-margin minimum classification error training for large-scale speech recognition tasks
    Yu, Dong
    Deng, Li
    He, Xiaodong
    Acero, Alex
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 1137 - +
  • [23] Large-Scale Slow Feature Analysis Using Spark for Visual Object Recognition
    Li, Da
    Zhang, Zhang
    Tan, Tieniu
    [J]. COMPUTER VISION, PT III, 2017, 773 : 132 - 142
  • [24] Visual landmark recognition from Internet photo collections: A large-scale evaluation
    Weyand, Tobias
    Leibe, Bastian
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2015, 135 : 1 - 15
  • [25] Jointly Learning Visually Correlated Dictionaries for Large-Scale Visual Recognition Applications
    Zhou, Ning
    Fan, Jianping
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014, 36 (04) : 715 - 730
  • [26] Large Scale Visual Food Recognition
    Min, Weiqing
    Wang, Zhiling
    Liu, Yuxin
    Luo, Mengjiang
    Kang, Liping
    Wei, Xiaoming
    Wei, Xiaolin
    Jiang, Shuqiang
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (08) : 9932 - 9949
  • [27] A Visual Backchannel for Large-Scale Events
    Doerk, Marian
    Gruen, Daniel
    Williamson, Carey
    Carpendale, Sheelagh
    [J]. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2010, 16 (06) : 1129 - 1138
  • [28] Large-Scale Visual Relationship Understanding
    Zhang, Ji
    Kalantidis, Yannis
    Rohrbach, Marcus
    Paluri, Manohar
    Elgammal, Ahmed
    Elhoseiny, Mohamed
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9185 - 9194
  • [29] Large-Scale Visual Data Analysis
    Johnson, Chris
    [J]. 2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2012, : 1 - 1
  • [30] A Large-Scale Evaluation of Speech Foundation Models
    Yang, Shu-wen
    Chang, Heng-Jui
    Huang, Zili
    Liu, Andy T.
    Lai, Cheng-, I
    Wu, Haibin
    Shi, Jiatong
    Chang, Xuankai
    Tsai, Hsiang-Sheng
    Huang, Wen-Chin
    Feng, Tzu-hsun
    Chi, Po-Han
    Lin, Yist Y.
    Chuang, Yung-Sung
    Huang, Tzu-Hsien
    Tseng, Wei-Cheng
    Lakhotia, Kushal
    Li, Shang-Wen
    Mohamed, Abdelrahman
    Watanabe, Shinji
    Lee, Hung-yi
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2884 - 2899