Frame-dependent multi-stream reliability indicators for audio-visual speech recognition

被引：0

作者：

Garg, A ^{[1
]}

Potamianos, G ^{[1
]}

Neti, C ^{[1
]}

Huang, TS ^{[1
]}

机构：

[1] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA

来源：

2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING I | 2003年

关键词：

D O I：

暂无

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

We investigate the use of local, frame-dependent reliability indicators of the audio and visual modalities, as a means of estimating stream exponents of multi-stream hidden Markov models for audio-visual automatic speech recognition. We consider two such indicators at each modality, defined as functions of the speech-class conditional observation probabilities of appropriate audio-or visual-only classifiers. We subsequently map the four reliability indicators into the stream exponents of a state-synchronous, two-stream hidden Markov model, as a sigmoid function of their linear combination. We propose two algorithms to estimate the sigmoid weights, based on the maximum conditional likelihood and minimum classification error criteria. We demonstrate the superiority of the proposed approach on a connected-digit audio-visual speech recognition task, under varying audio channel noise conditions. Indeed, the use of the estimated frame-dependent stream exponents results in a significantly smaller word error rate than using global stream exponents. In addition, it outperforms utterance-level exponents, even though the latter utilize a-priori knowledge of the utterance noise level.

引用

页码：24 / 27

页数：4

共 50 条

[1] Frame-dependent multi-stream reliability indicators for audio-visual speech recognition
Garg, A
Potamianos, G
Neti, C
Huang, TS
[J]. 2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL III, PROCEEDINGS, 2003, : 605 - 608
[2] Multi-stream asynchrony modeling for audio-visual speech recognition
Lv, Guoyun
Jiang, Dongmei
Zhao, Rongchun
Hou, Yunshu
[J]. ISM 2007: NINTH IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA, PROCEEDINGS, 2007, : 37 - 44
[3] DBN based multi-stream models for audio-visual speech recognition
Gowdy, JN
Subramanya, A
Bartels, C
Bilmes, J
[J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 993 - 996
[4] Fused HMM-Adaptation of Multi-Stream HMMs for Audio-Visual Speech Recognition
Dean, David
Lucey, Patrick
Sridharan, Sridha
Wark, Tim
[J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2272 - 2275
[5] Multi-stream confidence analysis for audio-visual affect recognition
Zeng, ZH
Tu, JL
Liu, M
Huang, TS
[J]. AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION, PROCEEDINGS, 2005, 3784 : 964 - 971
[6] A stream-weight optimization method for audio-visual speech recognition using multi-stream HMMS
Tamura, S
Iwano, K
Furui, S
[J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 857 - 860
[7] Multi-stream articulator model with adaptive reliability measure for audio visual speech recognition
Xie, Lei
Liu, Zhi-Qiang
[J]. ADVANCES IN MACHINE LEARNING AND CYBERNETICS, 2006, 3930 : 994 - 1004
[8] Multi-stream product modal audio-visual integration strategy for robust adaptive speech recognition
Gurbuz, S
Tufekci, Z
Patterson, E
Gowdy, JN
[J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2021 - 2024
[9] Multi-Stream Asynchrony Dynamic Bayesian Network model for audio-visual continuous speech recognition
Lv, Guoyun
Jiang, Dongmei
Zhao, Rongchun
Jiang, Xiaoyue
Sahli, H.
[J]. 2007 14TH INTERNATIONAL WORKSHOP ON SYSTEMS, SIGNALS, & IMAGE PROCESSING & EURASIP CONFERENCE FOCUSED ON SPEECH & IMAGE PROCESSING, MULTIMEDIA COMMUNICATIONS & SERVICES, 2007, : 170 - +
[10] Combined Discriminative Training for Multi-Stream HMM-based Audio-Visual Speech Recognition
Huang, Jing
Visweswariah, Karthik
[J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1399 - +

← 1 2 3 4 5 →