Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

被引：0

作者：

Jianrong Wang

Ju Zhang

Kiyoshi Honda

Jianguo Wei

Jianwu Dang

机构：

[1] Tianjin University,School of Computer Science and Technology

[2] Tianjin University,School of Computer Software

来源：

Multimedia Systems | 2016年 / 22卷

关键词：

Audio-visual speech recognition; 3D lip information; Microsoft Kinect; Multimodal fusion;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition performance due to the visual speech information obtained typically from planar video data. In this paper, we deviate from the traditional visual speech information and propose an AVSR system integrating 3D lip information. The Microsoft Kinect multi-sensory device was adopted for data collection. The different feature extraction and selection algorithms were applied to planar images and 3D lip information, so as to fuse the planar images and 3D lip feature into the visual-3D lip joint feature. For automatic speech recognition (ASR), the fusion methods were investigated and the audio-visual speech information was integrated into a state-synchronous two stream Hidden Markov Model. The experimental results demonstrated that our AVSR system integrating 3D lip information improved the recognition performance of traditional ASR and AVSR system in acoustic noise environments.

引用

页码：315 / 323

页数：8

共 50 条

[1] Audio-visual speech recognition integrating 3D lip information obtained from the Kinect
Wang, Jianrong
Zhang, Ju
Honda, Kiyoshi
Wei, Jianguo
Dang, Jianwu
[J]. MULTIMEDIA SYSTEMS, 2016, 22 (03) : 315 - 323
[2] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
Su, Rongfeng
Wang, Lan
Liu, Xunying
[J]. 2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
[3] AUDIO-VISUAL SPEECH RECOGNITION INCORPORATING FACIAL DEPTH INFORMATION CAPTURED BY THE KINECT
Galatas, Georgios
Potamianos, Gerasimos
Makedon, Fillia
[J]. 2012 PROCEEDINGS OF THE 20TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2012, : 2714 - 2717
[4] Lip movement synthesis in audio-visual speech recognition system
Li, JQ
Yin, YX
[J]. PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 461 - 465
[5] Analysis of lip geometric features for audio-visual speech recognition
Kaynak, MN
Zhi, Q
Cheok, AD
Sengupta, K
Han, Z
Chung, KC
[J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART A-SYSTEMS AND HUMANS, 2004, 34 (04): : 564 - 570
[6] Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images
Koji Iwano
Tomoaki Yoshinaga
Satoshi Tamura
Sadaoki Furui
[J]. EURASIP Journal on Audio, Speech, and Music Processing, 2007
[7] Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images
Iwano, Koji
Yoshinaga, Tomoaki
Tamura, Satoshi
Furui, Sadaoki
[J]. EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2007, 2007 (1)
[8] Multistage information fusion for audio-visual speech recognition
Chu, SM
Libal, V
Marcheret, E
Neti, C
Potamianos, G
[J]. 2004 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXP (ICME), VOLS 1-3, 2004, : 1651 - 1654
[9] Information Fusion Techniques in Audio-Visual Speech Recognition
Karabalkan, H.
Erdogan, H.
[J]. 2009 IEEE 17TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, VOLS 1 AND 2, 2009, : 734 - 737
[10] Lip Tracking Method for the System of Audio-Visual Polish Speech Recognition
Kubanek, Mariusz
Bobulski, Janusz
Adrjanowicz, Lukasz
[J]. ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, PT I, 2012, 7267 : 535 - 542

← 1 2 3 4 5 →