Audio-Visual Speech Recognition System Using Recurrent Neural Network

被引:0
|
作者
Goh, Yeh-Huann [1 ]
Lau, Kai-Xian [1 ]
Lee, Yoon-Ket [1 ]
机构
[1] Tunku Abdul Rahman Univ Coll, Fac Engn & Technol, Jalan Genting Kelang, Kuala Lumpur 53300, Malaysia
关键词
Recurrent Neural Network; Speech Recognition System; Audio-visual Speech Recognition System; Features Integration Mechanism; Audio Features Extraction; Mechanism;
D O I
10.1109/incit.2019.8912049
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
An audio-visual speech recognition system (AVSR) integrates audio and visual information to perform speech recognition task. The AVSR has various applications in practice especially in natural language processing systems such as speech-to-text conversion, automatic translation and sentiment analysis. Decades ago, researchers tend to use Hidden Markov Model (HMM) to construct speech recognition system due to its good achievements in success recognition rate. However, HMM's training dataset is enormous in order to have sufficient linguistic coverage. Besides, its recognition rate under noisy environments is not satisfying. To overcome this deficiency, a Recurrent Neural Network (RNN) based AVSR is proposed. The proposed AVSR model consists of three components: 1) audio features extraction mechanism, 2) visual features extraction mechanism and 3) audio and visual features integration mechanism. The features integration mechanism combines the output features from both audio and visual extraction mechanisms to generate final classification results. In this research, the audio features mechanism is modelled by Mel-frequency Cepstrum Coefficient (MFCC) and further processed by RNN system, whereas the visual features mechanism is modelled by Haar-Cascade Detection with OpenCV and again, it is further processed by RNN system. Then, both of these extracted features were integrated by multimodal RNN-based features-integration mechanism. The performance in terms of the speech recognition rate and the robustness of the proposed AVSR system were evaluated using speech under clean environment and Signal Noise Ratio (SNR) levels ranging from -20 dB to 20 dB with 5 dB interval. On average, final speech recognition rate is 89% across different levels of SNR.
引用
收藏
页码:38 / 43
页数:6
相关论文
共 50 条
  • [41] Recognition of Isolated Digit Using Random Forest for Audio-Visual Speech Recognition
    Prashant Borde
    Sadanand Kulkarni
    Bharti Gawali
    Pravin Yannawar
    [J]. Proceedings of the National Academy of Sciences, India Section A: Physical Sciences, 2022, 92 : 103 - 110
  • [42] Recognition of Isolated Digit Using Random Forest for Audio-Visual Speech Recognition
    Borde, Prashant
    Kulkarni, Sadanand
    Gawali, Bharti
    Yannawar, Pravin
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES INDIA SECTION A-PHYSICAL SCIENCES, 2022, 92 (01) : 103 - 110
  • [43] Using Twin-HMM-Based Audio-Visual Speech Enhancement as a Front-End for Robust Audio-Visual Speech Recognition
    Abdelaziz, Ahmed Hussen
    Zeiler, Steffen
    Kolossa, Dorothea
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 867 - 871
  • [44] Lip Tracking Method for the System of Audio-Visual Polish Speech Recognition
    Kubanek, Mariusz
    Bobulski, Janusz
    Adrjanowicz, Lukasz
    [J]. ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, PT I, 2012, 7267 : 535 - 542
  • [45] Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
    Zhang, Shiqing
    Zhang, Shiliang
    Huang, Tiejun
    Gao, Wen
    [J]. ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 281 - 284
  • [46] A Neural Network Architecture for Children's Audio-Visual Emotion Recognition
    Matveev, Anton
    Matveev, Yuri
    Frolova, Olga
    Nikolaev, Aleksandr
    Lyakso, Elena
    [J]. MATHEMATICS, 2023, 11 (22)
  • [47] Audio-visual speech recognition using minimum classification error training
    Miyajima, C
    Tokuda, K
    Kitamura, T
    [J]. NEURAL NETWORKS FOR SIGNAL PROCESSING X, VOLS 1 AND 2, PROCEEDINGS, 2000, : 3 - 12
  • [48] Audio-Visual Action Recognition Using Transformer Fusion Network
    Kim, Jun-Hwa
    Won, Chee Sun
    [J]. APPLIED SCIENCES-BASEL, 2024, 14 (03):
  • [49] Speaker independent audio-visual continuous speech recognition
    Liang, LH
    Liu, XX
    Zhao, YB
    Pi, XB
    Nefian, AV
    [J]. IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : A25 - A28
  • [50] Building a data corpus for audio-visual speech recognition
    Chitu, Alin G.
    Rothkrantz, Leon J. M.
    [J]. EUROMEDIA '2007, 2007, : 88 - 92