Robust sensor fusion: Analysis and application to audio visual speech recognition

被引：13

作者：

Movellan, JR ^{[1
]}

Mineiro, P ^{[1
]}

机构：

[1] Univ Calif San Diego, Dept Cognit Sci, La Jolla, CA 92093 USA

来源：

MACHINE LEARNING | 1998年 / 32卷 / 02期

关键词：

catastrophic fusion; Bayesian inference; robust statistics; audio visual speech recognition;

D O I：

10.1023/A:1007468413059

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper analyzes the issue of catastrophic fusion, a problem that occurs in multimodal recognition systems that integrate the output from several modules while working in non-stationary environments. For concreteness we frame the analysis with regard to the problem of automatic audio visual speech recognition (AVSR), but the issues at hand are very general and arise in multimodal recognition systems which need to work in a wide variety of contexts. Catastrophic fusion is said to have occurred when the performance of a multimodal system is inferior to the performance of some isolated modules, e.g., when the performance of the audio visual speech recognition system is inferior to that of the audio system alone. Catastrophic fusion arises because recognition modules make implicit assumptions and thus operate correctly only within a certain context. Practice shows that when modules are tested in contexts inconsistent with their assumptions, their influence on the fused product tends to increase, with catastrophic results. We propose a principled solution to this problem based upon Bayesian ideas of competitive models and inference robustification. Pie study the approach analytically on a classic Gaussian discrimination task and then apply it to a realistic problem on audio visual speech recognition (AVSR) with excellent results.

引用

页码：85 / 100

页数：16

共 50 条

[31] Decision Level Fusion for Audio-Visual Speech Recognition in Noisy Conditions
Sad, Gonzalo D.
Terissi, Lucas D.
Gomez, Juan C.
PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2016, 2017, 10125 : 360 - 367
[32] Audio-Visual Domain Adaptation Feature Fusion for Speech Emotion Recognition
Wei, Jie
Hu, Guanyu
Yang, Xinyu
Luu, Anh Tuan
Dong, Yizhuo
INTERSPEECH 2022, 2022, : 1988 - 1992
[33] Performance Improvement of Audio-Visual Speech Recognition with Optimal Reliability Fusion
Tariquzzaman, Md
Gyu, Song Min
Young, Kim Jin
You, Na Seung
Rashid, M. A.
2010 THE 3RD INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND INDUSTRIAL APPLICATION (PACIIA2010), VOL III, 2010, : 216 - 219
[34] CATNet: Cross-modal fusion for audio-visual speech recognition
Wang, Xingmei
Mi, Jiachen
Li, Boquan
Zhao, Yixu
Meng, Jiaxiang
PATTERN RECOGNITION LETTERS, 2024, 178 : 216 - 222
[35] Audio-visual speech recognition in a Portuguese language based application
Pera, V
Sá, F
Afonso, P
Ferreira, R
2003 IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL TECHNOLOGY, VOLS 1 AND 2, PROCEEDINGS, 2003, : 688 - 692
[36] Multiobjectives Genetic Snakes:: Application on Audio-Visual Speech Recognition
Séguier, R
Cladel, N
PROCEEDINGS EC-VIP-MC 2003, VOLS 1 AND 2, 2003, : 625 - 630
[37] Multimodal information fusion using the iterative decoding algorithm and its application to audio-visual speech recognition
Shivappa, Shankar T.
Rao, Bhaskar D.
Trivedi, Mohan M.
2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 2241 - 2244
[38] CONTINUOUS VISUAL SPEECH RECOGNITION FOR AUDIO SPEECH ENHANCEMENT
Benhaim, Eric
Sahbi, Hichem
Vitte, Guillaume
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2244 - 2248
[39] Robust front-end for audio, visual and audio–visual speech classification
Terissi L.D.
Sad G.D.
Gómez J.C.
International Journal of Speech Technology, 2018, 21 (2) : 293 - 307
[40] Face-to-talk: Audio-visual speech detection for robust speech recognition in noisy environment
Murai, K
Nakamura, S
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2003, E86D (03): : 505 - 513

← 1 2 3 4 5 →