An asynchronous DBN for audio-visual speech recognition

被引:6
|
作者
Saenko, Kate [1 ]
Livescu, Karen [1 ]
机构
[1] MIT, Comp Sci & Artificial Intelligence Lab, 77 Massachusetts Ave, Cambridge, MA 02139 USA
关键词
speech recognition;
D O I
10.1109/SLT.2006.326841
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We investigate an asynchronous two-stream dynamic Bayesian network-based model for audio-visual speech recognition. The model allows the audio and visual streams to de-synchronize within the boundaries of each word. The probability of desynchronization by a given number of states is learned during training. This type of asynchrony has been previously used for pronunciation modeling and for visual speech recognition (lipreading); however, this is its first application to audiovisual speech recognition. We evaluate the model on an audiovisual corpus of English digits (CUAVE) with different levels of added acoustic noise, and compare it to several baselines. The asynchronous model outperforms audio-only and synchronous audio-visual baselines. We also compare models with different degrees of allowed asynchrony and find that the lowest error rate on this task is achieved when the audio and visual streams are allowed to desynchronize by up to two states.
引用
收藏
页码:154 / +
页数:2
相关论文
共 50 条
  • [41] An audio-visual corpus for multimodal automatic speech recognition
    Czyzewski, Andrzej
    Kostek, Bozena
    Bratoszewski, Piotr
    Kotus, Jozef
    Szykulski, Marcin
    [J]. JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2017, 49 (02) : 167 - 192
  • [42] Audio-Visual Multilevel Fusion for Speech and Speaker Recognition
    Chetty, Girija
    Wagner, Michael
    [J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 379 - 382
  • [43] Turbo Decoders for Audio-visual Continuous Speech Recognition
    Abdelaziz, Ahmed Hussen
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3667 - 3671
  • [44] Lips Detection for Audio-Visual Speech Recognition System
    Chin, Siew Wen
    Ang, Li-Minn
    Seng, Kah Phooi
    [J]. 2008 INTERNATIONAL SYMPOSIUM ON INTELLIGENT SIGNAL PROCESSING AND COMMUNICATIONS SYSTEMS (ISPACS 2008), 2008, : 311 - 314
  • [45] Dynamic Bayesian networks for audio-visual speech recognition
    Nefian, AV
    Liang, LH
    Pi, XB
    Liu, XX
    Murphy, K
    [J]. EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1274 - 1288
  • [46] Information Fusion Techniques in Audio-Visual Speech Recognition
    Karabalkan, H.
    Erdogan, H.
    [J]. 2009 IEEE 17TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, VOLS 1 AND 2, 2009, : 734 - 737
  • [47] Large Vocabulary Continuous Audio-Visual Speech Recognition
    Sterpu, George
    [J]. ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 538 - 541
  • [48] Audio-visual speech recognition using deep learning
    Kuniaki Noda
    Yuki Yamaguchi
    Kazuhiro Nakadai
    Hiroshi G. Okuno
    Tetsuya Ogata
    [J]. Applied Intelligence, 2015, 42 : 722 - 737
  • [49] AUDIO-VISUAL ISOLATED DIGIT RECOGNITION FOR WHISPERED SPEECH
    Fan, Xing
    Busso, Carlos
    Hansen, John H. L.
    [J]. 19TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2011), 2011, : 1500 - 1503
  • [50] A COMPACT FORMULATION OF TURBO AUDIO-VISUAL SPEECH RECOGNITION
    Receveur, Simon
    Meyer, Patrick
    Fingscheidt, Tim
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,