Large-Scale Visual Speech Recognition

被引:30
|
作者
Shillingford, Brendan [1 ]
Assael, Yannis [1 ]
Hoffman, Matthew W. [1 ]
Paine, Thomas [1 ]
Hughes, Cian [1 ]
Prabhu, Utsav [2 ]
Liao, Hank [2 ]
Sak, Hasim [2 ]
Rao, Kanishka [2 ]
Bennett, Lorrayne [1 ]
Mulville, Marie [1 ]
Denil, Misha [1 ]
Coppin, Ben [1 ]
Laurie, Ben [1 ]
Senior, Andrew [1 ]
de Freitas, Nando [1 ]
机构
[1] DeepMind, London, England
[2] Google, Mountain View, CA 94043 USA
来源
关键词
visual speech recognition; lipreading;
D O I
10.21437/Interspeech.2019-1669
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This work presents a scalable solution to continuous visual speech recognition. To achieve this, we constructed the largest existing visual speech recognition dataset, consisting of pairs of transcriptions and video clips of faces speaking (3,886 hours of video). In tandem, we designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequences of phoneme distributions, and a phoneme-to-word speech decoder that outputs sequences of words. The proposed system achieves a word error rate (WER) of 40.9% as measured on a held-out set. In comparison, professional lipreaders achieve either 86.4% or 92.9% WER on the same dataset when having access to additional types of contextual information. Our approach significantly improves on previous lipreading approaches, including variants of LipNet and of Watch, Attend, and Spell (WAS), which are only capable of 89.8% and 76.8% WER respectively.
引用
收藏
页码:4135 / 4139
页数:5
相关论文
共 50 条
  • [41] Semantic signatures for large-scale visual localization
    Weng, Li
    Gouet-Brunet, Valerie
    Soheilian, Bahman
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (15) : 22347 - 22372
  • [42] Semantic signatures for large-scale visual localization
    Li Weng
    Valérie Gouet-Brunet
    Bahman Soheilian
    [J]. Multimedia Tools and Applications, 2021, 80 : 22347 - 22372
  • [43] Implementation of Large-scale Object Recognition System
    Kim, Min-Uk
    Yoon, Kyoungro
    [J]. 2013 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND APPLICATIONS (ICISA 2013), 2013,
  • [44] Visual Exploration of Large-Scale System Evolution
    Wettel, Richard
    Lanza, Michele
    [J]. FIFTEENTH WORKING CONFERENCE ON REVERSE ENGINEERING, PROCEEDINGS, 2008, : 219 - 228
  • [45] Large-Scale Visual Odometry for Rough Terrain
    Konolige, Kurt
    Agrawal, Motilal
    Sola, Joan
    [J]. ROBOTICS RESEARCH, 2010, 66 : 201 - 212
  • [46] The Visual Perception of Large-Scale Distances Outdoors
    Norman, J. Farley
    Dukes, Jessica M.
    Shapiro, Hannah K.
    Peterson, Ashley E.
    [J]. PERCEPTION, 2020, 49 (09) : 968 - 977
  • [47] Large-Scale Human Action Recognition with Spark
    Wang, Hanli
    Zheng, Xiaobin
    Xiao, Bo
    [J]. 2015 IEEE 17TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2015,
  • [48] RFIW: Large-Scale Kinship Recognition Challenge
    Robinson, Joseph P.
    Shao, Ming
    Zhao, Handong
    Wu, Yue
    Gillis, Timothy
    Fu, Yun
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1971 - 1973
  • [49] Large-scale Pollen Recognition with Deep Learning
    de Geus, Andre R.
    Barcelos, Celia A. Z.
    Batista, Marcos A.
    da Silva, Sergio F.
    [J]. 2019 27TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2019,
  • [50] Boosting face recognition on a large-scale database
    Lu, J
    Plataniotis, KN
    [J]. 2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL II, PROCEEDINGS, 2002, : 109 - 112