Improving Audio-Visual Speech Recognition Using Gabor Recurrent Neural Networks

被引：1

作者：

Saudi, Ali S. ^{[1
]}

Khalil, Mahmoud I. ^{[2
]}

Abbas, Hazem M. ^{[2
]}

机构：

[1] German Univ Cairo, Fac Media Engn & Technol, New Cairo, Egypt

[2] Ain Shams Univ, Fac Engn, Cairo, Egypt

来源：

MULTIMODAL PATTERN RECOGNITION OF SOCIAL SIGNALS IN HUMAN-COMPUTER-INTERACTION, MPRSS 2018 | 2019年 / 11377卷

关键词：

Audio-Visual Speech Recognition; Bidirectional Recurrent Neural Network; Gabor filters; DISCRIMINANT-ANALYSIS;

D O I：

10.1007/978-3-030-20984-1_7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The performance of speech recognition systems can be significantly improved when visual information is used in conjunction with the audio ones, especially in noisy environments. Prompted by the great achievements of deep learning in solving Audio-Visual Speech Recognition (AVSR) problems, we propose a deep AVSR model based on Long Short-Term Memory Bidirectional Recurrent Neural Network (LSTM-BRNN). The proposed deep AVSR model utilizes the Gabor filters in both the audio and visual front-ends with Early Integration (EI) scheme. This model is termed as BRNNav model. The Gabor features simulate the underlying spatiotemporal processing chain that occurs in the Primary Audio Cortex (PAC) in conjunction with Primary Visual Cortex (PVC). We named it Gabor Audio Features (GAF) and Gabor Visual Features (GVF). The experimental results show that the deep Gabor (LSTM-BRNNs)-based model achieves superior performance when compared to the (GMM-HMM)-based models which utilize the same front-ends. Furthermore, the use of GAF and GVF in both audio and visual front-ends attain significant improvement in the performance compared to the traditional audio and visual features.

引用

页码：71 / 83

页数：13

共 50 条

[31] Visual speech recognition by recurrent neural networks
Rabi, G
Lu, SW
[J]. 1997 CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING, CONFERENCE PROCEEDINGS, VOLS I AND II: ENGINEERING INNOVATION: VOYAGE OF DISCOVERY, 1997, : 55 - 58
[32] USING MULTIPLE VISUAL TANDEM STREAMS IN AUDIO-VISUAL SPEECH RECOGNITION
Topkaya, Ibrahim Saygin
Erdogan, Hakan
[J]. 2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 4988 - 4991
[33] Audio-visual speech recognition using MPEGA compliant visual features
Aleksic, PS
Williams, JJ
Wu, ZL
Katsaggelos, AK
[J]. EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1213 - 1227
[34] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
Su, Rongfeng
Wang, Lan
Liu, Xunying
[J]. 2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
[35] AUDIO-VISUAL FUSION AND CONDITIONING WITH NEURAL NETWORKS FOR EVENT RECOGNITION
Brousmiche, Mathilde
Rouat, Jean
Dupont, Stephane
[J]. 2019 IEEE 29TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2019,
[36] Audio-Visual Speech Modeling for Continuous Speech Recognition
Dupont, Stephane
Luettin, Juergen
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) : 141 - 151
[37] Large vocabulary audio-visual speech recognition using the Janus speech recognition toolkit
Kratt, J
Metze, F
Stiefelhagen, R
Waibel, A
[J]. PATTERN RECOGNITION, 2004, 3175 : 488 - 495
[38] A coupled HMM for audio-visual speech recognition
Nefian, AV
Liang, LH
Pi, XB
Xiaoxiang, L
Mao, C
Murphy, K
[J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2013 - 2016
[39] Speaker independent audio-visual speech recognition
Zhang, Y
Levinson, S
Huang, T
[J]. 2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
[40] An asynchronous DBN for audio-visual speech recognition
Saenko, Kate
Livescu, Karen
[J]. 2006 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, 2006, : 154 - +

← 1 2 3 4 5 →