RECURRENT NEURAL NETWORK TRANSDUCER FOR AUDIO-VISUAL SPEECH RECOGNITION

被引：0

作者：

Makino, Takaki ^{[1
]}

Liao, Hank ^{[1
]}

Assael, Yannis ^{[2
]}

Shillingford, Brendan ^{[2
]}

Garcia, Basilio ^{[1
]}

Braga, Otavio ^{[1
]}

Siohan, Olivier ^{[1
]}

机构：

[1] Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 USA

[2] DeepMind, 6 Pancras Sq, London N1C 4AG, England

来源：

2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019) | 2019年

关键词：

Audio-visual speech recognition; recurrent neural network transducer;

D O I：

10.1109/asru46091.2019.9004036

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly available LRS3-TED set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YTDEV18 set artificially corrupted with background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the LRS3-TED set.

引用

页码：905 / 912

页数：8

共 50 条

[21] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
Tamura, Satoshi
Ishikawa, Masato
Hashiba, Takashi
Takeuchi, Shin'ichi
Hayamizu, Satoru
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
[22] A coupled HMM for audio-visual speech recognition
Nefian, AV
Liang, LH
Pi, XB
Xiaoxiang, L
Mao, C
Murphy, K
2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2013 - 2016
[23] Speaker independent audio-visual speech recognition
Zhang, Y
Levinson, S
Huang, T
2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
[24] An asynchronous DBN for audio-visual speech recognition
Saenko, Kate
Livescu, Karen
2006 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, 2006, : 154 - +
[25] Audio-visual modeling for bimodal speech recognition
Kaynak, MN
Zhi, Q
Cheok, AD
Sengupta, K
Chung, KC
2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 181 - 186
[26] Bimodal fusion in audio-visual speech recognition
Zhang, XZ
Mersereau, RM
Clements, M
2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
[27] A Neural Network Architecture for Children's Audio-Visual Emotion Recognition
Matveev, Anton
Matveev, Yuri
Frolova, Olga
Nikolaev, Aleksandr
Lyakso, Elena
MATHEMATICS, 2023, 11 (22)
[28] Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
Zhang, Shiqing
Zhang, Shiliang
Huang, Tiejun
Gao, Wen
ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 281 - 284
[29] Audio Visual Speech Recognition Using Deep Recurrent Neural Networks
Thanda, Abhinav
Venkatesan, Shankar M.
MULTIMODAL PATTERN RECOGNITION OF SOCIAL SIGNALS IN HUMAN-COMPUTER-INTERACTION, MPRSS 2016, 2017, 10183 : 98 - 109
[30] Speaker independent audio-visual continuous speech recognition
Liang, LH
Liu, XX
Zhao, YB
Pi, XB
Nefian, AV
IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : A25 - A28

← 1 2 3 4 5 →