Depth-based Features in Audio-Visual Speech Recognition

被引:0
|
作者
Palecek, Karel [1 ]
Chaloupka, Josef [1 ]
机构
[1] Tech Univ Liberec, Inst Informat Technol & Elect, Liberec 46117, Czech Republic
关键词
Audio-visual speech recognition; Depth-based features; Isolated words; Kinect; Lipreading; Multi-modal fusion; Multi-stream hidden Markov model;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study the impact of depth-based visual features in systems for visual and audio-visual speech recognition. Instead of reconstruction from multiple views, the depth maps are obtained by the Kinect sensor, which is better suited for real world applications. We extract several types of visual features from video and depth channels and evaluate their performance both individually and in cross-channel combination. In order to show the information complementarity between the video-based and the depth-based features, we examine the relative importance of each channel when combined via weighted multi-stream Hidden Markov Models. We also introduce novel parametrizations based on discrete cosine transform and histogram of oriented gradients. The contribution of all presented visual speech features is demonstrated in the task of audio-visual speech recognition under noisy conditions.
引用
收藏
页码:303 / 306
页数:4
相关论文
共 50 条
  • [31] Bimodal fusion in audio-visual speech recognition
    Zhang, XZ
    Mersereau, RM
    Clements, M
    [J]. 2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
  • [32] ROBUST AUDIO-VISUAL MANDARIN SPEECH RECOGNITION BASED ON ADAPTIVE DECISION FUSION AND TONE FEATURES
    Liu, Hong
    Chen, Zhengyan
    Shi, Wei
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1381 - 1385
  • [33] Multistream sparse representation features for noise robust audio-visual speech recognition
    Shen, Peng
    Tamura, Satoshi
    Hayamizu, Satoru
    [J]. ACOUSTICAL SCIENCE AND TECHNOLOGY, 2014, 35 (01) : 17 - 27
  • [34] Using Twin-HMM-Based Audio-Visual Speech Enhancement as a Front-End for Robust Audio-Visual Speech Recognition
    Abdelaziz, Ahmed Hussen
    Zeiler, Steffen
    Kolossa, Dorothea
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 867 - 871
  • [35] HMM-based audio-visual speech recognition integrating geometric- and appearance-based visual features
    Chan, MT
    [J]. 2001 IEEE FOURTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2001, : 9 - 14
  • [36] Audio-visual continuous speech recognition using mpeg-4 compliant visual features
    Aleksic, PS
    Williams, JJ
    Wu, ZL
    Katsaggelos, AK
    [J]. 2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 960 - 963
  • [37] Audio-Visual Emotion Recognition Based on Facial Expression and Affective Speech
    Zhang, Shiqing
    Li, Lemin
    Zhao, Zhijin
    [J]. MULTIMEDIA AND SIGNAL PROCESSING, 2012, 346 : 46 - +
  • [38] Audio-Visual Database for Spanish-Based Speech Recognition Systems
    Cordova-Esparza, Diana-Margarita
    Terven, Juan
    Romero, Alejandro
    Marcela Herrera-Navarro, Ana
    [J]. ADVANCES IN SOFT COMPUTING, MICAI 2019, 2019, 11835 : 452 - 460
  • [39] Optimizing Audio-Visual Speech Enhancement Using Multi-Level Distortion Measures for Audio-Visual Speech Recognition
    Chen, Hang
    Wang, Qing
    Du, Jun
    Yin, Bao-Cai
    Pan, Jia
    Lee, Chin-Hui
    [J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2024, 32 : 2508 - 2521
  • [40] Audio-Visual Speech Enhancement Based on Multiscale Features and Parallel Attention
    Jia, Shifan
    Zhang, Xinman
    Han, Weiqi
    [J]. 2024 23RD INTERNATIONAL SYMPOSIUM INFOTEH-JAHORINA, INFOTEH, 2024,