AUDIO-VISUAL SPEECH RECOGNITION INCORPORATING FACIAL DEPTH INFORMATION CAPTURED BY THE KINECT

被引：0

作者：

Galatas, Georgios ^{[1
,2
]}

Potamianos, Gerasimos ^{[1
,3
]}

Makedon, Fillia ^{[2
]}

机构：

[1] NCSR Demokritos, Inst Informat & Telecommun, Athens, Greece

[2] Univ Texas Arlington, Dept Comp Sci & Engn, Heracleia Lab, Arlington, TX 76019 USA

[3] Univ Thessaly, Dept Comp & Commun Engn, Volos, Greece

来源：

2012 PROCEEDINGS OF THE 20TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO) | 2012年

基金：

美国国家科学基金会;

关键词：

Audio-visual automatic speech recognition; depth information; multi-sensory fusion; linear discriminant analysis; Microsoft Kinect;

D O I：

暂无

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

We investigate the use of facial depth data of a speaking subject, captured by the Kinect device, as an additional speech-informative modality to incorporate to a traditional audiovisual automatic speech recognizer. We present our feature extraction algorithm for both visual and accompanying depth modalities, based on a discrete cosine transform of the mouth region-of-interest data, further transformed by a two-stage linear discriminant analysis projection to incorporate speech dynamics and improve classification. For automatic speech recognition utilizing the three available data streams (audio, visual, and depth), we consider both the feature and decision fusion paradigms, the latter via a state-synchronous tri-stream hidden Markov model. We report multi-speaker recognition results on a small-vocabulary task employing our recently collected bilingual audio-visual corpus with depth information, demonstrating improved recognition performance by the addition of the proposed depth stream, across a wide range of audio conditions.

引用

页码：2714 / 2717

页数：4

共 50 条

[1] Multistage information fusion for audio-visual speech recognition
Chu, SM
Libal, V
Marcheret, E
Neti, C
Potamianos, G
[J]. 2004 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXP (ICME), VOLS 1-3, 2004, : 1651 - 1654
[2] Information Fusion Techniques in Audio-Visual Speech Recognition
Karabalkan, H.
Erdogan, H.
[J]. 2009 IEEE 17TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, VOLS 1 AND 2, 2009, : 734 - 737
[3] Depth-based Features in Audio-Visual Speech Recognition
Palecek, Karel
Chaloupka, Josef
[J]. 2016 39TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2016, : 303 - 306
[4] Audio-visual speech recognition integrating 3D lip information obtained from the Kinect
Wang, Jianrong
Zhang, Ju
Honda, Kiyoshi
Wei, Jianguo
Dang, Jianwu
[J]. MULTIMEDIA SYSTEMS, 2016, 22 (03) : 315 - 323
[5] Audio-visual speech recognition integrating 3D lip information obtained from the Kinect
Jianrong Wang
Ju Zhang
Kiyoshi Honda
Jianguo Wei
Jianwu Dang
[J]. Multimedia Systems, 2016, 22 : 315 - 323
[6] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
Hwang, Jung-Wook
Park, Jeongkyun
Park, Rae-Hong
Park, Hyung-Min
[J]. APPLIED ACOUSTICS, 2023, 211
[7] An audio-visual speech recognition with a new mandarin audio-visual database
Liao, Wen-Yuan
Pao, Tsang-Long
Chen, Yu-Te
Chang, Tsun-Wei
[J]. INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
[8] Audio-Visual Emotion Recognition Based on Facial Expression and Affective Speech
Zhang, Shiqing
Li, Lemin
Zhao, Zhijin
[J]. MULTIMEDIA AND SIGNAL PROCESSING, 2012, 346 : 46 - +
[9] Information Theoretic Feature Extraction for Audio-Visual Speech Recognition
Gurban, Mihai
Thiran, Jean-Philippe
[J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2009, 57 (12) : 4765 - 4776
[10] Deep Audio-Visual Speech Recognition
Afouras, Triantafyllos
Chung, Joon Son
Senior, Andrew
Vinyals, Oriol
Zisserman, Andrew
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727

← 1 2 3 4 5 →