An asynchronous DBN for audio-visual speech recognition

被引：6

作者：

Saenko, Kate ^{[1
]}

Livescu, Karen ^{[1
]}

机构：

[1] MIT, Comp Sci & Artificial Intelligence Lab, 77 Massachusetts Ave, Cambridge, MA 02139 USA

来源：

2006 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP | 2006年

关键词：

speech recognition;

D O I：

10.1109/SLT.2006.326841

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We investigate an asynchronous two-stream dynamic Bayesian network-based model for audio-visual speech recognition. The model allows the audio and visual streams to de-synchronize within the boundaries of each word. The probability of desynchronization by a given number of states is learned during training. This type of asynchrony has been previously used for pronunciation modeling and for visual speech recognition (lipreading); however, this is its first application to audiovisual speech recognition. We evaluate the model on an audiovisual corpus of English digits (CUAVE) with different levels of added acoustic noise, and compare it to several baselines. The asynchronous model outperforms audio-only and synchronous audio-visual baselines. We also compare models with different degrees of allowed asynchrony and find that the lowest error rate on this task is achieved when the audio and visual streams are allowed to desynchronize by up to two states.

引用

页码：154 / +

页数：2

共 50 条

[21] Audio-visual fuzzy fusion for robust speech recognition
Malcangi, M.
Ouazzane, K.
Patel, P.
[J]. 2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2013,
[22] Audio-visual speech recognition using lstm and cnn
El Maghraby E.E.
Gody A.M.
Farouk M.H.
[J]. Recent Advances in Computer Science and Communications, 2021, 14 (06) : 2023 - 2039
[23] Building a data corpus for audio-visual speech recognition
Chitu, Alin G.
Rothkrantz, Leon J. M.
[J]. EUROMEDIA '2007, 2007, : 88 - 92
[24] Speaker independent audio-visual continuous speech recognition
Liang, LH
Liu, XX
Zhao, YB
Pi, XB
Nefian, AV
[J]. IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : A25 - A28
[25] Audio-Visual Speech Recognition in the Presence of a Competing Speaker
Shao, Xu
Barker, Jon
[J]. INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1292 - 1295
[26] Audio-Visual Automatic Speech Recognition for Connected Digits
Wang, Xiaoping
Hao, Yufeng
Fu, Degang
Yuan, Chunwei
[J]. 2008 INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY APPLICATION, VOL III, PROCEEDINGS, 2008, : 328 - +
[27] DARE: Deceiving Audio-Visual speech Recognition model
Mishra, Saumya
Gupta, Anup Kumar
Gupta, Puneet
[J]. KNOWLEDGE-BASED SYSTEMS, 2021, 232
[28] Multistage information fusion for audio-visual speech recognition
Chu, SM
Libal, V
Marcheret, E
Neti, C
Potamianos, G
[J]. 2004 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXP (ICME), VOLS 1-3, 2004, : 1651 - 1654
[29] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
Mroueh, Youssef
Marcheret, Etienne
Goel, Vaibhava
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
[30] Relevant feature selection for audio-visual speech recognition
Drugman, Thomas
Gurban, Mihai
Thiran, Jean-Philippe
[J]. 2007 IEEE NINTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2007, : 179 - +

← 1 2 3 4 5 →