Combined Discriminative Training for Multi-Stream HMM-based Audio-Visual Speech Recognition

被引:0
|
作者
Huang, Jing [1 ]
Visweswariah, Karthik [2 ]
机构
[1] IBM Corp, TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA
[2] IBM India Res Lab, Bangalore, Karnataka, India
关键词
discriminative training; audio-visual speech recognition; multi-stream HMM;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we investigate discriminative training of models and feature space for a multi-stream hidden Markov model (HMM) based audio-visual speech recognizer (AVSR). Since the two streams are used together in decoding, we propose to train the parameters of the two streams jointly. This is in contrast to prior work which has considered discriminative training of parameters in each stream independent of the other. In experiments on a 20-speaker one-hour speaker independent test set, we obtain 22% relative gain on AVSR performance over A/V models whose parameters are trained separately. and 50% relative gain on AVSR over the baseline maximum-likelihood models. On a noisy (mismatched to training) test set, we obtain 21% relative gain over AN models whose parameters are trained separately. This represents 30% relative improvement over the maximum-likelihood baseline.
引用
收藏
页码:1399 / +
页数:2
相关论文
共 50 条
  • [1] Improved Decision Trees for Multi-stream HMM-based Audio-Visual Continuous Speech Recognition
    Huang, Jing
    Visweswariah, Karthik
    [J]. 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 228 - +
  • [2] Rapid feature space speaker adaptation for multi-stream HMM-based audio-visual speech recognition
    Huang, J
    Marcheret, E
    Visweswariah, K
    [J]. 2005 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), VOLS 1 AND 2, 2005, : 338 - 341
  • [3] Discriminative training of HMM stream exponents for audio-visual speech recognition
    Potamianos, G
    Graf, HP
    [J]. PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-6, 1998, : 3733 - 3736
  • [4] Fused HMM-Adaptation of Multi-Stream HMMs for Audio-Visual Speech Recognition
    Dean, David
    Lucey, Patrick
    Sridharan, Sridha
    Wark, Tim
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2272 - 2275
  • [5] DBN based multi-stream models for audio-visual speech recognition
    Gowdy, JN
    Subramanya, A
    Bartels, C
    Bilmes, J
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 993 - 996
  • [6] Multi-stream asynchrony modeling for audio-visual speech recognition
    Lv, Guoyun
    Jiang, Dongmei
    Zhao, Rongchun
    Hou, Yunshu
    [J]. ISM 2007: NINTH IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA, PROCEEDINGS, 2007, : 37 - 44
  • [7] Audio-visual affect recognition through multi-stream fused HMM for HCI
    Zeng, ZH
    Tu, JL
    Pianfetti, B
    Liu, M
    Zhang, T
    Zhang, ZQ
    Huang, TS
    Levinson, S
    [J]. 2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 2, PROCEEDINGS, 2005, : 967 - 972
  • [8] DISCRIMINATIVE STREAM-WEIGHT TRAINING FOR MANDARIN AUDIO-VISUAL SPEECH RECOGNITION
    Wu, Guanyong
    Zhu, Jie
    [J]. JOURNAL OF THE CHINESE INSTITUTE OF ENGINEERS, 2010, 33 (05) : 775 - 780
  • [9] Frame-dependent multi-stream reliability indicators for audio-visual speech recognition
    Garg, A
    Potamianos, G
    Neti, C
    Huang, TS
    [J]. 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING I, 2003, : 24 - 27
  • [10] Frame-dependent multi-stream reliability indicators for audio-visual speech recognition
    Garg, A
    Potamianos, G
    Neti, C
    Huang, TS
    [J]. 2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL III, PROCEEDINGS, 2003, : 605 - 608