RECURRENT NEURAL NETWORK TRANSDUCER FOR AUDIO-VISUAL SPEECH RECOGNITION

被引:0
|
作者
Makino, Takaki [1 ]
Liao, Hank [1 ]
Assael, Yannis [2 ]
Shillingford, Brendan [2 ]
Garcia, Basilio [1 ]
Braga, Otavio [1 ]
Siohan, Olivier [1 ]
机构
[1] Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 USA
[2] DeepMind, 6 Pancras Sq, London N1C 4AG, England
关键词
Audio-visual speech recognition; recurrent neural network transducer;
D O I
10.1109/asru46091.2019.9004036
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly available LRS3-TED set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YTDEV18 set artificially corrupted with background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the LRS3-TED set.
引用
收藏
页码:905 / 912
页数:8
相关论文
共 50 条
  • [1] Audio-Visual Speech Recognition System Using Recurrent Neural Network
    Goh, Yeh-Huann
    Lau, Kai-Xian
    Lee, Yoon-Ket
    PROCEEDINGS OF THE 2019 4TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY (INCIT): ENCOMPASSING INTELLIGENT TECHNOLOGY AND INNOVATION TOWARDS THE NEW ERA OF HUMAN LIFE, 2019, : 38 - 43
  • [2] Improving Audio-Visual Speech Recognition Using Gabor Recurrent Neural Networks
    Saudi, Ali S.
    Khalil, Mahmoud I.
    Abbas, Hazem M.
    MULTIMODAL PATTERN RECOGNITION OF SOCIAL SIGNALS IN HUMAN-COMPUTER-INTERACTION, MPRSS 2018, 2019, 11377 : 71 - 83
  • [3] Audio-Visual (Multimodal) Speech Recognition System Using Deep Neural Network
    Paulin, Hebsibah
    Milton, R. S.
    JanakiRaman, S.
    Chandraprabha, K.
    JOURNAL OF TESTING AND EVALUATION, 2019, 47 (06) : 3963 - 3974
  • [4] Fuzzy-Neural-Network Based Audio-Visual Fusion for Speech Recognition
    Wu, Gin-Der
    Tsai, Hao-Shu
    2019 1ST INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE IN INFORMATION AND COMMUNICATION (ICAIIC 2019), 2019, : 210 - 214
  • [5] Robustness of a chaotic modal neural network applied to audio-visual speech recognition
    Kabre, H
    NEURAL NETWORKS FOR SIGNAL PROCESSING VII, 1997, : 607 - 616
  • [6] RBF neural network mouth tracking for audio-visual speech recognition system
    Hui, LE
    Seng, KP
    Tse, KM
    TENCON 2004 - 2004 IEEE REGION 10 CONFERENCE, VOLS A-D, PROCEEDINGS: ANALOG AND DIGITAL TECHNIQUES IN ELECTRICAL ENGINEERING, 2004, : A84 - A87
  • [7] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
    Hwang, Jung-Wook
    Park, Jeongkyun
    Park, Rae-Hong
    Park, Hyung-Min
    APPLIED ACOUSTICS, 2023, 211
  • [8] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [9] A Deep Neural Network for Audio-Visual Person Recognition
    Alam, Mohammad Rafiqul
    Bennamoun, Mohammed
    Togneri, Roberto
    Sohel, Ferdous
    2015 IEEE 7TH INTERNATIONAL CONFERENCE ON BIOMETRICS THEORY, APPLICATIONS AND SYSTEMS (BTAS 2015), 2015,
  • [10] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727