Deep Audio-Visual Speech Recognition

被引:328
|
作者
Afouras, Triantafyllos [1 ]
Chung, Joon Son [1 ]
Senior, Andrew [2 ]
Vinyals, Oriol [2 ]
Zisserman, Andrew [1 ,2 ]
机构
[1] Univ Oxford, Oxford OX1 2JD, England
[2] Google DeepMind, London N1C 4AG, England
基金
英国工程与自然科学研究理事会;
关键词
Hidden Markov models; Lips; Speech recognition; Visualization; Videos; Feeds; Training; Lip reading; audio visual speech recognition; deep learning; NETWORKS;
D O I
10.1109/TPAMI.2018.2889052
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) we compare two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss. Both models are built on top of the transformer self-attention architecture; (2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal is noisy; (3) we introduce and publicly release a new dataset for audio-visual speech recognition, LRS2-BBC, consisting of thousands of natural sentences from British television. The models that we train surpass the performance of all previous work on a lip reading benchmark dataset by a significant margin.
引用
收藏
页码:8717 / 8727
页数:11
相关论文
共 50 条
  • [1] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
    Mroueh, Youssef
    Marcheret, Etienne
    Goel, Vaibhava
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
  • [2] Audio-visual speech recognition using deep learning
    Noda, Kuniaki
    Yamaguchi, Yuki
    Nakadai, Kazuhiro
    Okuno, Hiroshi G.
    Ogata, Tetsuya
    [J]. APPLIED INTELLIGENCE, 2015, 42 (04) : 722 - 737
  • [3] Audio-visual speech recognition using deep learning
    Kuniaki Noda
    Yuki Yamaguchi
    Kazuhiro Nakadai
    Hiroshi G. Okuno
    Tetsuya Ogata
    [J]. Applied Intelligence, 2015, 42 : 722 - 737
  • [4] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
    Hwang, Jung-Wook
    Park, Jeongkyun
    Park, Rae-Hong
    Park, Hyung-Min
    [J]. APPLIED ACOUSTICS, 2023, 211
  • [5] Scope for Deep Learning:A Study in Audio-Visual Speech Recognition
    Bhaskar, Shabina
    Thasleema, T. M.
    [J]. PROCEEDINGS OF 2019 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND KNOWLEDGE ECONOMY (ICCIKE' 2019), 2019, : 72 - 77
  • [6] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    [J]. INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [7] Integration of Deep Bottleneck Features for Audio-Visual Speech Recognition
    Ninomiya, Hiroshi
    Kitaoka, Norihide
    Tamura, Satoshi
    Iribe, Yurie
    Takeda, Kazuya
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 563 - 567
  • [8] AUDIO-VISUAL DEEP LEARNING FOR NOISE ROBUST SPEECH RECOGNITION
    Huang, Jing
    Kingsbury, Brian
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7596 - 7599
  • [9] MULTIPOSE AUDIO-VISUAL SPEECH RECOGNITION
    Estellers, Virginia
    Thiran, Jean-Philippe
    [J]. 19TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2011), 2011, : 1065 - 1069
  • [10] Audio-visual integration for speech recognition
    Kober, R
    Harz, U
    [J]. NEUROLOGY PSYCHIATRY AND BRAIN RESEARCH, 1996, 4 (04) : 179 - 184