Large-Scale Visual Speech Recognition

被引：30

作者：

Shillingford, Brendan ^{[1
]}

Assael, Yannis ^{[1
]}

Hoffman, Matthew W. ^{[1
]}

Paine, Thomas ^{[1
]}

Hughes, Cian ^{[1
]}

Prabhu, Utsav ^{[2
]}

Liao, Hank ^{[2
]}

Sak, Hasim ^{[2
]}

Rao, Kanishka ^{[2
]}

Bennett, Lorrayne ^{[1
]}

Mulville, Marie ^{[1
]}

Denil, Misha ^{[1
]}

Coppin, Ben ^{[1
]}

Laurie, Ben ^{[1
]}

Senior, Andrew ^{[1
]}

de Freitas, Nando ^{[1
]}

机构：

[1] DeepMind, London, England

[2] Google, Mountain View, CA 94043 USA

来源：

INTERSPEECH 2019 | 2019年

关键词：

visual speech recognition; lipreading;

D O I：

10.21437/Interspeech.2019-1669

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This work presents a scalable solution to continuous visual speech recognition. To achieve this, we constructed the largest existing visual speech recognition dataset, consisting of pairs of transcriptions and video clips of faces speaking (3,886 hours of video). In tandem, we designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequences of phoneme distributions, and a phoneme-to-word speech decoder that outputs sequences of words. The proposed system achieves a word error rate (WER) of 40.9% as measured on a held-out set. In comparison, professional lipreaders achieve either 86.4% or 92.9% WER on the same dataset when having access to additional types of contextual information. Our approach significantly improves on previous lipreading approaches, including variants of LipNet and of Watch, Attend, and Spell (WAS), which are only capable of 89.8% and 76.8% WER respectively.

引用

页码：4135 / 4139

页数：5

共 50 条

[21] Speech Recognition with Large-Scale Speaker-Class-Based Acoustic Modeling
Konno, Kazuki
Kato, Masaharu
Kosaka, Tetsuo
[J]. 2013 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2013,
[22] Large-margin minimum classification error training for large-scale speech recognition tasks
Yu, Dong
Deng, Li
He, Xiaodong
Acero, Alex
[J]. 2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 1137 - +
[23] Large-Scale Slow Feature Analysis Using Spark for Visual Object Recognition
Li, Da
Zhang, Zhang
Tan, Tieniu
[J]. COMPUTER VISION, PT III, 2017, 773 : 132 - 142
[24] Visual landmark recognition from Internet photo collections: A large-scale evaluation
Weyand, Tobias
Leibe, Bastian
[J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2015, 135 : 1 - 15
[25] Jointly Learning Visually Correlated Dictionaries for Large-Scale Visual Recognition Applications
Zhou, Ning
Fan, Jianping
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014, 36 (04) : 715 - 730
[26] Large Scale Visual Food Recognition
Min, Weiqing
Wang, Zhiling
Liu, Yuxin
Luo, Mengjiang
Kang, Liping
Wei, Xiaoming
Wei, Xiaolin
Jiang, Shuqiang
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (08) : 9932 - 9949
[27] A Visual Backchannel for Large-Scale Events
Doerk, Marian
Gruen, Daniel
Williamson, Carey
Carpendale, Sheelagh
[J]. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2010, 16 (06) : 1129 - 1138
[28] Large-Scale Visual Relationship Understanding
Zhang, Ji
Kalantidis, Yannis
Rohrbach, Marcus
Paluri, Manohar
Elgammal, Ahmed
Elhoseiny, Mohamed
[J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9185 - 9194
[29] Large-Scale Visual Data Analysis
Johnson, Chris
[J]. 2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2012, : 1 - 1
[30] A Large-Scale Evaluation of Speech Foundation Models
Yang, Shu-wen
Chang, Heng-Jui
Huang, Zili
Liu, Andy T.
Lai, Cheng-, I
Wu, Haibin
Shi, Jiatong
Chang, Xuankai
Tsai, Hsiang-Sheng
Huang, Wen-Chin
Feng, Tzu-hsun
Chi, Po-Han
Lin, Yist Y.
Chuang, Yung-Sung
Huang, Tzu-Hsien
Tseng, Wei-Cheng
Lakhotia, Kushal
Li, Shang-Wen
Mohamed, Abdelrahman
Watanabe, Shinji
Lee, Hung-yi
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2884 - 2899

← 1 2 3 4 5 →