Rapid feature space speaker adaptation for multi-stream HMM-based audio-visual speech recognition

被引:0
|
作者
Huang, J [1 ]
Marcheret, E [1 ]
Visweswariah, K [1 ]
机构
[1] IBM Corp, TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-stream hidden Markov models (HMMs) have recently been very successful in audio-visual speech recognition, where the audio and visual streams are fused at the final decision level. In this paper we investigate fast feature space speaker adaptation using multi-strearn HMMs for audio-visual speech recognition. In particular, we focus on studying the performance of feature-space maximum likelihood linear regression (fMLLR), a fast and effective method for estimating feature space transforms. Unlike the common speaker adaptation techniques of MAP or MLLR, fMLLR does not change the audio or visual HMM parameters, but simply applies a single transform to the testing features. We also address the problem of fast and robust on-line fMLLR adaptation using feature space maximum a posterior linear regression (fMAPLR). Adaptation experiments are reported on the IBM infrared headset audio-visual database. On average for a 20-speaker 1 hour independent test set, the multi-stream fMLLR achieves 31% relative gain on the clean audio condition, and 59% relative gain on the noisy audio condition (approximately 7dB) as compared to the baseline multi-stream system.
引用
收藏
页码:338 / 341
页数:4
相关论文
共 50 条
  • [1] Combined Discriminative Training for Multi-Stream HMM-based Audio-Visual Speech Recognition
    Huang, Jing
    Visweswariah, Karthik
    [J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1399 - +
  • [2] Improved Decision Trees for Multi-stream HMM-based Audio-Visual Continuous Speech Recognition
    Huang, Jing
    Visweswariah, Karthik
    [J]. 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 228 - +
  • [3] Fused HMM-Adaptation of Multi-Stream HMMs for Audio-Visual Speech Recognition
    Dean, David
    Lucey, Patrick
    Sridharan, Sridha
    Wark, Tim
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2272 - 2275
  • [4] DBN based multi-stream models for audio-visual speech recognition
    Gowdy, JN
    Subramanya, A
    Bartels, C
    Bilmes, J
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 993 - 996
  • [5] Multi-stream asynchrony modeling for audio-visual speech recognition
    Lv, Guoyun
    Jiang, Dongmei
    Zhao, Rongchun
    Hou, Yunshu
    [J]. ISM 2007: NINTH IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA, PROCEEDINGS, 2007, : 37 - 44
  • [6] Audio-visual affect recognition through multi-stream fused HMM for HCI
    Zeng, ZH
    Tu, JL
    Pianfetti, B
    Liu, M
    Zhang, T
    Zhang, ZQ
    Huang, TS
    Levinson, S
    [J]. 2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 2, PROCEEDINGS, 2005, : 967 - 972
  • [7] Frame-dependent multi-stream reliability indicators for audio-visual speech recognition
    Garg, A
    Potamianos, G
    Neti, C
    Huang, TS
    [J]. 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING I, 2003, : 24 - 27
  • [8] Frame-dependent multi-stream reliability indicators for audio-visual speech recognition
    Garg, A
    Potamianos, G
    Neti, C
    Huang, TS
    [J]. 2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL III, PROCEEDINGS, 2003, : 605 - 608
  • [9] Multi-stream confidence analysis for audio-visual affect recognition
    Zeng, ZH
    Tu, JL
    Liu, M
    Huang, TS
    [J]. AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION, PROCEEDINGS, 2005, 3784 : 964 - 971
  • [10] Discriminative training of HMM stream exponents for audio-visual speech recognition
    Potamianos, G
    Graf, HP
    [J]. PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-6, 1998, : 3733 - 3736