Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

被引:0
|
作者
Jianrong Wang
Ju Zhang
Kiyoshi Honda
Jianguo Wei
Jianwu Dang
机构
[1] Tianjin University,School of Computer Science and Technology
[2] Tianjin University,School of Computer Software
来源
Multimedia Systems | 2016年 / 22卷
关键词
Audio-visual speech recognition; 3D lip information; Microsoft Kinect; Multimodal fusion;
D O I
暂无
中图分类号
学科分类号
摘要
Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition performance due to the visual speech information obtained typically from planar video data. In this paper, we deviate from the traditional visual speech information and propose an AVSR system integrating 3D lip information. The Microsoft Kinect multi-sensory device was adopted for data collection. The different feature extraction and selection algorithms were applied to planar images and 3D lip information, so as to fuse the planar images and 3D lip feature into the visual-3D lip joint feature. For automatic speech recognition (ASR), the fusion methods were investigated and the audio-visual speech information was integrated into a state-synchronous two stream Hidden Markov Model. The experimental results demonstrated that our AVSR system integrating 3D lip information improved the recognition performance of traditional ASR and AVSR system in acoustic noise environments.
引用
收藏
页码:315 / 323
页数:8
相关论文
共 50 条
  • [1] Audio-visual speech recognition integrating 3D lip information obtained from the Kinect
    Wang, Jianrong
    Zhang, Ju
    Honda, Kiyoshi
    Wei, Jianguo
    Dang, Jianwu
    [J]. MULTIMEDIA SYSTEMS, 2016, 22 (03) : 315 - 323
  • [2] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
    Su, Rongfeng
    Wang, Lan
    Liu, Xunying
    [J]. 2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
  • [3] AUDIO-VISUAL SPEECH RECOGNITION INCORPORATING FACIAL DEPTH INFORMATION CAPTURED BY THE KINECT
    Galatas, Georgios
    Potamianos, Gerasimos
    Makedon, Fillia
    [J]. 2012 PROCEEDINGS OF THE 20TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2012, : 2714 - 2717
  • [4] Lip movement synthesis in audio-visual speech recognition system
    Li, JQ
    Yin, YX
    [J]. PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 461 - 465
  • [5] Analysis of lip geometric features for audio-visual speech recognition
    Kaynak, MN
    Zhi, Q
    Cheok, AD
    Sengupta, K
    Han, Z
    Chung, KC
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART A-SYSTEMS AND HUMANS, 2004, 34 (04): : 564 - 570
  • [6] Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images
    Koji Iwano
    Tomoaki Yoshinaga
    Satoshi Tamura
    Sadaoki Furui
    [J]. EURASIP Journal on Audio, Speech, and Music Processing, 2007
  • [7] Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images
    Iwano, Koji
    Yoshinaga, Tomoaki
    Tamura, Satoshi
    Furui, Sadaoki
    [J]. EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2007, 2007 (1)
  • [8] Multistage information fusion for audio-visual speech recognition
    Chu, SM
    Libal, V
    Marcheret, E
    Neti, C
    Potamianos, G
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXP (ICME), VOLS 1-3, 2004, : 1651 - 1654
  • [9] Information Fusion Techniques in Audio-Visual Speech Recognition
    Karabalkan, H.
    Erdogan, H.
    [J]. 2009 IEEE 17TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, VOLS 1 AND 2, 2009, : 734 - 737
  • [10] Lip Tracking Method for the System of Audio-Visual Polish Speech Recognition
    Kubanek, Mariusz
    Bobulski, Janusz
    Adrjanowicz, Lukasz
    [J]. ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, PT I, 2012, 7267 : 535 - 542