Product HMMs for audio-visual continuous speech recognition using facial animation parameters

被引:0
|
作者
Aleksic, PS [1 ]
Katsaggelos, AK [1 ]
机构
[1] Northwestern Univ, Dept Elect & Comp Engn, Evanston, IL 60208 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The use of visual information in addition to acoustic can improve automatic speech recognition. In this paper we compare different approaches for audio-visual information integration and show how they affect automatic speech recognition performance. We utilize Facial Animation Parameters (FAPs), supported by the MPEG-4 standard for the visual representation as visual features. We use both Single-stream and Multi-stream Hidden Markov Models (HMM) to integrate audio and visual information. We performed both state and phone synchronous multi-stream integration. Product HMM topology is used to model the phone-synchronous integration. ASR experiments were performed under noisy audio conditions using a relatively large vocabulary (approximately 1000 words) audio-visual database. The proposed phone-synchronous system, which performed the best, reduces the word error rate (WER) by approximately 20% relatively to audio-only ASR (A-ASR) WERs, at various SNRs with additive white Gaussian noise.
引用
下载
收藏
页码:481 / 484
页数:4
相关论文
共 50 条
  • [21] MULTIPOSE AUDIO-VISUAL SPEECH RECOGNITION
    Estellers, Virginia
    Thiran, Jean-Philippe
    19TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2011), 2011, : 1065 - 1069
  • [22] A stream-weight optimization method for audio-visual speech recognition using multi-stream HMMS
    Tamura, S
    Iwano, K
    Furui, S
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 857 - 860
  • [23] Audio-visual integration for speech recognition
    Kober, R
    Harz, U
    NEUROLOGY PSYCHIATRY AND BRAIN RESEARCH, 1996, 4 (04) : 179 - 184
  • [24] Audio-visual speech recognition by speechreading
    Zhang, XZ
    Mersereau, RM
    Clements, MA
    DSP 2002: 14TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING PROCEEDINGS, VOLS 1 AND 2, 2002, : 1069 - 1072
  • [25] Fused HMM-Adaptation of Multi-Stream HMMs for Audio-Visual Speech Recognition
    Dean, David
    Lucey, Patrick
    Sridharan, Sridha
    Wark, Tim
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2272 - 2275
  • [26] Audio-visual speech recognition based on optimized product HMMs and GMM based-MCE-GPD stream weight estimation
    Kumatani, K
    Nakamura, S
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2003, E86D (03): : 454 - 463
  • [27] An audio-visual speech recognition system for testing new audio-visual databases
    Pao, Tsang-Long
    Liao, Wen-Yuan
    VISAPP 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOL 2, 2006, : 192 - +
  • [28] Optimizing Audio-Visual Speech Enhancement Using Multi-Level Distortion Measures for Audio-Visual Speech Recognition
    Chen, Hang
    Wang, Qing
    Du, Jun
    Yin, Bao-Cai
    Pan, Jia
    Lee, Chin-Hui
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2508 - 2521
  • [29] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
  • [30] Audio-Visual Speech Recognition in Noisy Audio Environments
    Palecek, Karel
    Chaloupka, Josef
    2013 36TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2013, : 484 - 487