Multi-stream product modal audio-visual integration strategy for robust adaptive speech recognition

被引:0
|
作者
Gurbuz, S [1 ]
Tufekci, Z [1 ]
Patterson, E [1 ]
Gowdy, JN [1 ]
机构
[1] Clemson Univ, Dept Elect & Comp Engn, Clemson, SC 29634 USA
关键词
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we extend an existing audio-only automatic speech recognizer to implement a multi-stream audio-visual automatic speech recognition (AV-ASR) system. Our method forms a multi-stream feature vector from audio-visual speech data, computes the statistical modal parameters probabilities on the basis of multi-stream audio-visual features, and performs dynamic programming jointly on the multi-stream product modal Hidden Markov Models (MSPM-HMMs) by utilizing a noise type and signal-to-noise ratio (SNR) based stream-weighting value. Experimental results are presented for an isolated word recognition task for eight different noise types from the NOISEX data base for several SNR values. The proposed system reduces the word error rate (WER), averaged over several SNR and noise types, from 55.9% with the audio-only recognizer and 7.9% with the late-integration audio-visual recognizer to 2.6% WER in the validation set.
引用
收藏
页码:2021 / 2024
页数:4
相关论文
共 50 条
  • [1] Multi-stream asynchrony modeling for audio-visual speech recognition
    Lv, Guoyun
    Jiang, Dongmei
    Zhao, Rongchun
    Hou, Yunshu
    [J]. ISM 2007: NINTH IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA, PROCEEDINGS, 2007, : 37 - 44
  • [2] DBN based multi-stream models for audio-visual speech recognition
    Gowdy, JN
    Subramanya, A
    Bartels, C
    Bilmes, J
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 993 - 996
  • [3] Fused HMM-Adaptation of Multi-Stream HMMs for Audio-Visual Speech Recognition
    Dean, David
    Lucey, Patrick
    Sridharan, Sridha
    Wark, Tim
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2272 - 2275
  • [4] Frame-dependent multi-stream reliability indicators for audio-visual speech recognition
    Garg, A
    Potamianos, G
    Neti, C
    Huang, TS
    [J]. 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING I, 2003, : 24 - 27
  • [5] Frame-dependent multi-stream reliability indicators for audio-visual speech recognition
    Garg, A
    Potamianos, G
    Neti, C
    Huang, TS
    [J]. 2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL III, PROCEEDINGS, 2003, : 605 - 608
  • [6] Multi-modal temporal asynchronicity modeling by product HMMs for robust audio-visual speech recognition
    Nakamura, S
    Kumatani, K
    Tamura, S
    [J]. FOURTH IEEE INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, PROCEEDINGS, 2002, : 305 - 309
  • [7] Multi-stream confidence analysis for audio-visual affect recognition
    Zeng, ZH
    Tu, JL
    Liu, M
    Huang, TS
    [J]. AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION, PROCEEDINGS, 2005, 3784 : 964 - 971
  • [8] A stream-weight optimization method for audio-visual speech recognition using multi-stream HMMS
    Tamura, S
    Iwano, K
    Furui, S
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 857 - 860
  • [9] Audio-visual Integration for Robust Speech Recognition Using Maximum Weighted Stream Posteriors
    Seymour, Rowan
    Stewart, Darryl
    Ming, Ji
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 869 - 872
  • [10] Combined Discriminative Training for Multi-Stream HMM-based Audio-Visual Speech Recognition
    Huang, Jing
    Visweswariah, Karthik
    [J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1399 - +