Depth-based Features in Audio-Visual Speech Recognition

被引:0
|
作者
Palecek, Karel [1 ]
Chaloupka, Josef [1 ]
机构
[1] Tech Univ Liberec, Inst Informat Technol & Elect, Liberec 46117, Czech Republic
关键词
Audio-visual speech recognition; Depth-based features; Isolated words; Kinect; Lipreading; Multi-modal fusion; Multi-stream hidden Markov model;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study the impact of depth-based visual features in systems for visual and audio-visual speech recognition. Instead of reconstruction from multiple views, the depth maps are obtained by the Kinect sensor, which is better suited for real world applications. We extract several types of visual features from video and depth channels and evaluate their performance both individually and in cross-channel combination. In order to show the information complementarity between the video-based and the depth-based features, we examine the relative importance of each channel when combined via weighted multi-stream Hidden Markov Models. We also introduce novel parametrizations based on discrete cosine transform and histogram of oriented gradients. The contribution of all presented visual speech features is demonstrated in the task of audio-visual speech recognition under noisy conditions.
引用
收藏
页码:303 / 306
页数:4
相关论文
共 50 条
  • [1] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
    Hwang, Jung-Wook
    Park, Jeongkyun
    Park, Rae-Hong
    Park, Hyung-Min
    [J]. APPLIED ACOUSTICS, 2023, 211
  • [2] Audio-visual speech recognition using MPEGA compliant visual features
    Aleksic, PS
    Williams, JJ
    Wu, ZL
    Katsaggelos, AK
    [J]. EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1213 - 1227
  • [3] Integration of Deep Bottleneck Features for Audio-Visual Speech Recognition
    Ninomiya, Hiroshi
    Kitaoka, Norihide
    Tamura, Satoshi
    Iribe, Yurie
    Takeda, Kazuya
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 563 - 567
  • [4] Analysis of lip geometric features for audio-visual speech recognition
    Kaynak, MN
    Zhi, Q
    Cheok, AD
    Sengupta, K
    Han, Z
    Chung, KC
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART A-SYSTEMS AND HUMANS, 2004, 34 (04): : 564 - 570
  • [5] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    [J]. INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [6] Connectionism based audio-visual speech recognition method
    Che, Na
    Zhu, Yi-Ming
    Zhao, Jian
    Sun, Lei
    Shi, Li-Juan
    Zeng, Xian-Wei
    [J]. Jilin Daxue Xuebao (Gongxueban)/Journal of Jilin University (Engineering and Technology Edition), 2024, 54 (10): : 2984 - 2993
  • [7] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
  • [8] MULTIPOSE AUDIO-VISUAL SPEECH RECOGNITION
    Estellers, Virginia
    Thiran, Jean-Philippe
    [J]. 19TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2011), 2011, : 1065 - 1069
  • [9] Audio-visual integration for speech recognition
    Kober, R
    Harz, U
    [J]. NEUROLOGY PSYCHIATRY AND BRAIN RESEARCH, 1996, 4 (04) : 179 - 184
  • [10] Audio-visual speech recognition by speechreading
    Zhang, XZ
    Mersereau, RM
    Clements, MA
    [J]. DSP 2002: 14TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING PROCEEDINGS, VOLS 1 AND 2, 2002, : 1069 - 1072