Robust sensor fusion: Analysis and application to audio visual speech recognition

被引:13
|
作者
Movellan, JR [1 ]
Mineiro, P [1 ]
机构
[1] Univ Calif San Diego, Dept Cognit Sci, La Jolla, CA 92093 USA
关键词
catastrophic fusion; Bayesian inference; robust statistics; audio visual speech recognition;
D O I
10.1023/A:1007468413059
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper analyzes the issue of catastrophic fusion, a problem that occurs in multimodal recognition systems that integrate the output from several modules while working in non-stationary environments. For concreteness we frame the analysis with regard to the problem of automatic audio visual speech recognition (AVSR), but the issues at hand are very general and arise in multimodal recognition systems which need to work in a wide variety of contexts. Catastrophic fusion is said to have occurred when the performance of a multimodal system is inferior to the performance of some isolated modules, e.g., when the performance of the audio visual speech recognition system is inferior to that of the audio system alone. Catastrophic fusion arises because recognition modules make implicit assumptions and thus operate correctly only within a certain context. Practice shows that when modules are tested in contexts inconsistent with their assumptions, their influence on the fused product tends to increase, with catastrophic results. We propose a principled solution to this problem based upon Bayesian ideas of competitive models and inference robustification. Pie study the approach analytically on a classic Gaussian discrimination task and then apply it to a realistic problem on audio visual speech recognition (AVSR) with excellent results.
引用
收藏
页码:85 / 100
页数:16
相关论文
共 50 条
  • [41] Using Twin-HMM-Based Audio-Visual Speech Enhancement as a Front-End for Robust Audio-Visual Speech Recognition
    Abdelaziz, Ahmed Hussen
    Zeiler, Steffen
    Kolossa, Dorothea
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 867 - 871
  • [42] Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition
    Hong, Joanna
    Kim, Minsu
    Yoo, Daehun
    Ro, Yong Man
    INTERSPEECH 2022, 2022, : 2838 - 2842
  • [43] Audio-Visual Speech Recognition in Noisy Audio Environments
    Palecek, Karel
    Chaloupka, Josef
    2013 36TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2013, : 484 - 487
  • [44] Multistream sparse representation features for noise robust audio-visual speech recognition
    Shen, Peng
    Tamura, Satoshi
    Hayamizu, Satoru
    ACOUSTICAL SCIENCE AND TECHNOLOGY, 2014, 35 (01) : 17 - 27
  • [45] Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring
    Hong, Joanna
    Kim, Minsu
    Choi, Jeongsoo
    Ro, Yong Man
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18783 - 18794
  • [46] A robust visual feature extraction based BTSM-LDA for audio-visual speech recognition
    Lv, Guoyun
    Zhao, Rongchun
    Jiang, Dongmei
    Li, Yan
    Sahli, H.
    2007 SECOND INTERNATIONAL CONFERENCE IN COMMUNICATIONS AND NETWORKING IN CHINA, VOLS 1 AND 2, 2007, : 1044 - +
  • [47] Fuzzy-Neural-Network Based Audio-Visual Fusion for Speech Recognition
    Wu, Gin-Der
    Tsai, Hao-Shu
    2019 1ST INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE IN INFORMATION AND COMMUNICATION (ICAIIC 2019), 2019, : 210 - 214
  • [48] Audio-visual integration for speech recognition
    Kober, R
    Harz, U
    NEUROLOGY PSYCHIATRY AND BRAIN RESEARCH, 1996, 4 (04) : 179 - 184
  • [49] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
  • [50] MULTIPOSE AUDIO-VISUAL SPEECH RECOGNITION
    Estellers, Virginia
    Thiran, Jean-Philippe
    19TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2011), 2011, : 1065 - 1069