AUDIO-VISUAL SPEECH RECOGNITION INCORPORATING FACIAL DEPTH INFORMATION CAPTURED BY THE KINECT

被引:0
|
作者
Galatas, Georgios [1 ,2 ]
Potamianos, Gerasimos [1 ,3 ]
Makedon, Fillia [2 ]
机构
[1] NCSR Demokritos, Inst Informat & Telecommun, Athens, Greece
[2] Univ Texas Arlington, Dept Comp Sci & Engn, Heracleia Lab, Arlington, TX 76019 USA
[3] Univ Thessaly, Dept Comp & Commun Engn, Volos, Greece
基金
美国国家科学基金会;
关键词
Audio-visual automatic speech recognition; depth information; multi-sensory fusion; linear discriminant analysis; Microsoft Kinect;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
We investigate the use of facial depth data of a speaking subject, captured by the Kinect device, as an additional speech-informative modality to incorporate to a traditional audiovisual automatic speech recognizer. We present our feature extraction algorithm for both visual and accompanying depth modalities, based on a discrete cosine transform of the mouth region-of-interest data, further transformed by a two-stage linear discriminant analysis projection to incorporate speech dynamics and improve classification. For automatic speech recognition utilizing the three available data streams (audio, visual, and depth), we consider both the feature and decision fusion paradigms, the latter via a state-synchronous tri-stream hidden Markov model. We report multi-speaker recognition results on a small-vocabulary task employing our recently collected bilingual audio-visual corpus with depth information, demonstrating improved recognition performance by the addition of the proposed depth stream, across a wide range of audio conditions.
引用
收藏
页码:2714 / 2717
页数:4
相关论文
共 50 条
  • [1] Multistage information fusion for audio-visual speech recognition
    Chu, SM
    Libal, V
    Marcheret, E
    Neti, C
    Potamianos, G
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXP (ICME), VOLS 1-3, 2004, : 1651 - 1654
  • [2] Information Fusion Techniques in Audio-Visual Speech Recognition
    Karabalkan, H.
    Erdogan, H.
    [J]. 2009 IEEE 17TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, VOLS 1 AND 2, 2009, : 734 - 737
  • [3] Depth-based Features in Audio-Visual Speech Recognition
    Palecek, Karel
    Chaloupka, Josef
    [J]. 2016 39TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2016, : 303 - 306
  • [4] Audio-visual speech recognition integrating 3D lip information obtained from the Kinect
    Wang, Jianrong
    Zhang, Ju
    Honda, Kiyoshi
    Wei, Jianguo
    Dang, Jianwu
    [J]. MULTIMEDIA SYSTEMS, 2016, 22 (03) : 315 - 323
  • [5] Audio-visual speech recognition integrating 3D lip information obtained from the Kinect
    Jianrong Wang
    Ju Zhang
    Kiyoshi Honda
    Jianguo Wei
    Jianwu Dang
    [J]. Multimedia Systems, 2016, 22 : 315 - 323
  • [6] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
    Hwang, Jung-Wook
    Park, Jeongkyun
    Park, Rae-Hong
    Park, Hyung-Min
    [J]. APPLIED ACOUSTICS, 2023, 211
  • [7] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    [J]. INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [8] Audio-Visual Emotion Recognition Based on Facial Expression and Affective Speech
    Zhang, Shiqing
    Li, Lemin
    Zhao, Zhijin
    [J]. MULTIMEDIA AND SIGNAL PROCESSING, 2012, 346 : 46 - +
  • [9] Information Theoretic Feature Extraction for Audio-Visual Speech Recognition
    Gurban, Mihai
    Thiran, Jean-Philippe
    [J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2009, 57 (12) : 4765 - 4776
  • [10] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727