RECURRENT NEURAL NETWORK TRANSDUCER FOR AUDIO-VISUAL SPEECH RECOGNITION

被引：0

作者：

Makino, Takaki ^{[1
]}

Liao, Hank ^{[1
]}

Assael, Yannis ^{[2
]}

Shillingford, Brendan ^{[2
]}

Garcia, Basilio ^{[1
]}

Braga, Otavio ^{[1
]}

Siohan, Olivier ^{[1
]}

机构：

[1] Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 USA

[2] DeepMind, 6 Pancras Sq, London N1C 4AG, England

来源：

2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019) | 2019年

关键词：

Audio-visual speech recognition; recurrent neural network transducer;

D O I：

10.1109/asru46091.2019.9004036

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly available LRS3-TED set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YTDEV18 set artificially corrupted with background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the LRS3-TED set.

引用

页码：905 / 912

页数：8

共 50 条

[1] Audio-Visual Speech Recognition System Using Recurrent Neural Network
Goh, Yeh-Huann
Lau, Kai-Xian
Lee, Yoon-Ket
PROCEEDINGS OF THE 2019 4TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY (INCIT): ENCOMPASSING INTELLIGENT TECHNOLOGY AND INNOVATION TOWARDS THE NEW ERA OF HUMAN LIFE, 2019, : 38 - 43
[2] Improving Audio-Visual Speech Recognition Using Gabor Recurrent Neural Networks
Saudi, Ali S.
Khalil, Mahmoud I.
Abbas, Hazem M.
MULTIMODAL PATTERN RECOGNITION OF SOCIAL SIGNALS IN HUMAN-COMPUTER-INTERACTION, MPRSS 2018, 2019, 11377 : 71 - 83
[3] Audio-Visual (Multimodal) Speech Recognition System Using Deep Neural Network
Paulin, Hebsibah
Milton, R. S.
JanakiRaman, S.
Chandraprabha, K.
JOURNAL OF TESTING AND EVALUATION, 2019, 47 (06) : 3963 - 3974
[4] Fuzzy-Neural-Network Based Audio-Visual Fusion for Speech Recognition
Wu, Gin-Der
Tsai, Hao-Shu
2019 1ST INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE IN INFORMATION AND COMMUNICATION (ICAIIC 2019), 2019, : 210 - 214
[5] Robustness of a chaotic modal neural network applied to audio-visual speech recognition
Kabre, H
NEURAL NETWORKS FOR SIGNAL PROCESSING VII, 1997, : 607 - 616
[6] RBF neural network mouth tracking for audio-visual speech recognition system
Hui, LE
Seng, KP
Tse, KM
TENCON 2004 - 2004 IEEE REGION 10 CONFERENCE, VOLS A-D, PROCEEDINGS: ANALOG AND DIGITAL TECHNIQUES IN ELECTRICAL ENGINEERING, 2004, : A84 - A87
[7] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
Hwang, Jung-Wook
Park, Jeongkyun
Park, Rae-Hong
Park, Hyung-Min
APPLIED ACOUSTICS, 2023, 211
[8] An audio-visual speech recognition with a new mandarin audio-visual database
Liao, Wen-Yuan
Pao, Tsang-Long
Chen, Yu-Te
Chang, Tsun-Wei
INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
[9] A Deep Neural Network for Audio-Visual Person Recognition
Alam, Mohammad Rafiqul
Bennamoun, Mohammed
Togneri, Roberto
Sohel, Ferdous
2015 IEEE 7TH INTERNATIONAL CONFERENCE ON BIOMETRICS THEORY, APPLICATIONS AND SYSTEMS (BTAS 2015), 2015,
[10] Deep Audio-Visual Speech Recognition
Afouras, Triantafyllos
Chung, Joon Son
Senior, Andrew
Vinyals, Oriol
Zisserman, Andrew
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727

← 1 2 3 4 5 →