Robust sensor fusion: Analysis and application to audio visual speech recognition

被引:13
|
作者
Movellan, JR [1 ]
Mineiro, P [1 ]
机构
[1] Univ Calif San Diego, Dept Cognit Sci, La Jolla, CA 92093 USA
关键词
catastrophic fusion; Bayesian inference; robust statistics; audio visual speech recognition;
D O I
10.1023/A:1007468413059
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper analyzes the issue of catastrophic fusion, a problem that occurs in multimodal recognition systems that integrate the output from several modules while working in non-stationary environments. For concreteness we frame the analysis with regard to the problem of automatic audio visual speech recognition (AVSR), but the issues at hand are very general and arise in multimodal recognition systems which need to work in a wide variety of contexts. Catastrophic fusion is said to have occurred when the performance of a multimodal system is inferior to the performance of some isolated modules, e.g., when the performance of the audio visual speech recognition system is inferior to that of the audio system alone. Catastrophic fusion arises because recognition modules make implicit assumptions and thus operate correctly only within a certain context. Practice shows that when modules are tested in contexts inconsistent with their assumptions, their influence on the fused product tends to increase, with catastrophic results. We propose a principled solution to this problem based upon Bayesian ideas of competitive models and inference robustification. Pie study the approach analytically on a classic Gaussian discrimination task and then apply it to a realistic problem on audio visual speech recognition (AVSR) with excellent results.
引用
收藏
页码:85 / 100
页数:16
相关论文
共 50 条
  • [31] Decision Level Fusion for Audio-Visual Speech Recognition in Noisy Conditions
    Sad, Gonzalo D.
    Terissi, Lucas D.
    Gomez, Juan C.
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2016, 2017, 10125 : 360 - 367
  • [32] Audio-Visual Domain Adaptation Feature Fusion for Speech Emotion Recognition
    Wei, Jie
    Hu, Guanyu
    Yang, Xinyu
    Luu, Anh Tuan
    Dong, Yizhuo
    INTERSPEECH 2022, 2022, : 1988 - 1992
  • [33] Performance Improvement of Audio-Visual Speech Recognition with Optimal Reliability Fusion
    Tariquzzaman, Md
    Gyu, Song Min
    Young, Kim Jin
    You, Na Seung
    Rashid, M. A.
    2010 THE 3RD INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND INDUSTRIAL APPLICATION (PACIIA2010), VOL III, 2010, : 216 - 219
  • [34] CATNet: Cross-modal fusion for audio-visual speech recognition
    Wang, Xingmei
    Mi, Jiachen
    Li, Boquan
    Zhao, Yixu
    Meng, Jiaxiang
    PATTERN RECOGNITION LETTERS, 2024, 178 : 216 - 222
  • [35] Audio-visual speech recognition in a Portuguese language based application
    Pera, V
    Sá, F
    Afonso, P
    Ferreira, R
    2003 IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL TECHNOLOGY, VOLS 1 AND 2, PROCEEDINGS, 2003, : 688 - 692
  • [36] Multiobjectives Genetic Snakes:: Application on Audio-Visual Speech Recognition
    Séguier, R
    Cladel, N
    PROCEEDINGS EC-VIP-MC 2003, VOLS 1 AND 2, 2003, : 625 - 630
  • [37] Multimodal information fusion using the iterative decoding algorithm and its application to audio-visual speech recognition
    Shivappa, Shankar T.
    Rao, Bhaskar D.
    Trivedi, Mohan M.
    2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 2241 - 2244
  • [38] CONTINUOUS VISUAL SPEECH RECOGNITION FOR AUDIO SPEECH ENHANCEMENT
    Benhaim, Eric
    Sahbi, Hichem
    Vitte, Guillaume
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2244 - 2248
  • [39] Robust front-end for audio, visual and audio–visual speech classification
    Terissi L.D.
    Sad G.D.
    Gómez J.C.
    International Journal of Speech Technology, 2018, 21 (2) : 293 - 307
  • [40] Face-to-talk: Audio-visual speech detection for robust speech recognition in noisy environment
    Murai, K
    Nakamura, S
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2003, E86D (03): : 505 - 513