Bimodal fusion in audio-visual speech recognition

被引:0
|
作者
Zhang, XZ [1 ]
Mersereau, RM [1 ]
Clements, M [1 ]
机构
[1] Georgia Inst Technol, Ctr Signal & Image Proc, Atlanta, GA 30332 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Extending automatic speech recognition (ASR) to the visual modality has been shown to greatly increase recognition accuracy and improve system robustness over purely acoustic systems, especially in acoustically hostile environments. An important aspect of designing such systems is how to incorporate the visual component into the acoustic speech recognizer to achieve optimal performance. In this paper, we investigate methods of integrating the audio and visual modalities within HMM-based classification models. We examine existing integration schemes and propose the use of a coupled hidden Markov model (CHMM) to exploit audio-visual interaction. Our experimental results demonstrate that the CHMM consistently outperforms other integration models for a large range of acoustic noise levels and suggest that it better captures temporal correlations between the two streams of information.
引用
收藏
页码:964 / 967
页数:4
相关论文
共 50 条
  • [21] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
  • [22] Audio-Visual Speech Recognition in Noisy Audio Environments
    Palecek, Karel
    Chaloupka, Josef
    [J]. 2013 36TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2013, : 484 - 487
  • [23] Audio-Visual Speech Modeling for Continuous Speech Recognition
    Dupont, Stephane
    Luettin, Juergen
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) : 141 - 151
  • [24] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
    Tamura, Satoshi
    Ishikawa, Masato
    Hashiba, Takashi
    Takeuchi, Shin'ichi
    Hayamizu, Satoru
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
  • [25] A coupled HMM for audio-visual speech recognition
    Nefian, AV
    Liang, LH
    Pi, XB
    Xiaoxiang, L
    Mao, C
    Murphy, K
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2013 - 2016
  • [26] Speaker independent audio-visual speech recognition
    Zhang, Y
    Levinson, S
    Huang, T
    [J]. 2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
  • [27] An asynchronous DBN for audio-visual speech recognition
    Saenko, Kate
    Livescu, Karen
    [J]. 2006 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, 2006, : 154 - +
  • [28] Fuzzy-Neural-Network Based Audio-Visual Fusion for Speech Recognition
    Wu, Gin-Der
    Tsai, Hao-Shu
    [J]. 2019 1ST INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE IN INFORMATION AND COMMUNICATION (ICAIIC 2019), 2019, : 210 - 214
  • [29] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
    Sterpu, George
    Saam, Christian
    Harte, Naomi
    [J]. ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 111 - 115
  • [30] Scene recognition with audio-visual sensor fusion
    Devicharan, D
    Mehrotra, KG
    Mohan, CK
    Varshney, PK
    Zuo, L
    [J]. Multisensor, Multisource Information Fusion: Architectures, Algorithms and Applications 2005, 2005, 5813 : 201 - 210