Frame-dependent multi-stream reliability indicators for audio-visual speech recognition

被引:0
|
作者
Garg, A [1 ]
Potamianos, G [1 ]
Neti, C [1 ]
Huang, TS [1 ]
机构
[1] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA
关键词
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We investigate the use of local, frame-dependent reliability indicators of the audio and visual modalities, as a means of estimating stream exponents of multi-stream hidden Markov models for audio-visual automatic speech recognition. We consider two such indicators at each modality, defined as functions of the speech-class conditional observation probabilities of appropriate audio-or visual-only classifiers. We subsequently map the four reliability indicators into the stream exponents of a state-synchronous, two-stream hidden Markov model, as a sigmoid function of their linear combination. We propose two algorithms to estimate the sigmoid weights, based on the maximum conditional likelihood and minimum classification error criteria. We demonstrate the superiority of the proposed approach on a connected-digit audio-visual speech recognition task, under varying audio channel noise conditions. Indeed, the use of the estimated frame-dependent stream exponents results in a significantly smaller word error rate than using global stream exponents. In addition, it outperforms utterance-level exponents, even though the latter utilize a-priori knowledge of the utterance noise level.
引用
收藏
页码:24 / 27
页数:4
相关论文
共 50 条
  • [1] Frame-dependent multi-stream reliability indicators for audio-visual speech recognition
    Garg, A
    Potamianos, G
    Neti, C
    Huang, TS
    [J]. 2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL III, PROCEEDINGS, 2003, : 605 - 608
  • [2] Multi-stream asynchrony modeling for audio-visual speech recognition
    Lv, Guoyun
    Jiang, Dongmei
    Zhao, Rongchun
    Hou, Yunshu
    [J]. ISM 2007: NINTH IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA, PROCEEDINGS, 2007, : 37 - 44
  • [3] DBN based multi-stream models for audio-visual speech recognition
    Gowdy, JN
    Subramanya, A
    Bartels, C
    Bilmes, J
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 993 - 996
  • [4] Fused HMM-Adaptation of Multi-Stream HMMs for Audio-Visual Speech Recognition
    Dean, David
    Lucey, Patrick
    Sridharan, Sridha
    Wark, Tim
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2272 - 2275
  • [5] Multi-stream confidence analysis for audio-visual affect recognition
    Zeng, ZH
    Tu, JL
    Liu, M
    Huang, TS
    [J]. AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION, PROCEEDINGS, 2005, 3784 : 964 - 971
  • [6] A stream-weight optimization method for audio-visual speech recognition using multi-stream HMMS
    Tamura, S
    Iwano, K
    Furui, S
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 857 - 860
  • [7] Multi-stream articulator model with adaptive reliability measure for audio visual speech recognition
    Xie, Lei
    Liu, Zhi-Qiang
    [J]. ADVANCES IN MACHINE LEARNING AND CYBERNETICS, 2006, 3930 : 994 - 1004
  • [8] Multi-stream product modal audio-visual integration strategy for robust adaptive speech recognition
    Gurbuz, S
    Tufekci, Z
    Patterson, E
    Gowdy, JN
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2021 - 2024
  • [9] Combined Discriminative Training for Multi-Stream HMM-based Audio-Visual Speech Recognition
    Huang, Jing
    Visweswariah, Karthik
    [J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1399 - +
  • [10] Multi-Stream Asynchrony Dynamic Bayesian Network model for audio-visual continuous speech recognition
    Lv, Guoyun
    Jiang, Dongmei
    Zhao, Rongchun
    Jiang, Xiaoyue
    Sahli, H.
    [J]. 2007 14TH INTERNATIONAL WORKSHOP ON SYSTEMS, SIGNALS, & IMAGE PROCESSING & EURASIP CONFERENCE FOCUSED ON SPEECH & IMAGE PROCESSING, MULTIMEDIA COMMUNICATIONS & SERVICES, 2007, : 170 - +