Audio-Visual Speech Recognition System Using Recurrent Neural Network

被引:0
|
作者
Goh, Yeh-Huann [1 ]
Lau, Kai-Xian [1 ]
Lee, Yoon-Ket [1 ]
机构
[1] Tunku Abdul Rahman Univ Coll, Fac Engn & Technol, Jalan Genting Kelang, Kuala Lumpur 53300, Malaysia
关键词
Recurrent Neural Network; Speech Recognition System; Audio-visual Speech Recognition System; Features Integration Mechanism; Audio Features Extraction; Mechanism;
D O I
10.1109/incit.2019.8912049
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
An audio-visual speech recognition system (AVSR) integrates audio and visual information to perform speech recognition task. The AVSR has various applications in practice especially in natural language processing systems such as speech-to-text conversion, automatic translation and sentiment analysis. Decades ago, researchers tend to use Hidden Markov Model (HMM) to construct speech recognition system due to its good achievements in success recognition rate. However, HMM's training dataset is enormous in order to have sufficient linguistic coverage. Besides, its recognition rate under noisy environments is not satisfying. To overcome this deficiency, a Recurrent Neural Network (RNN) based AVSR is proposed. The proposed AVSR model consists of three components: 1) audio features extraction mechanism, 2) visual features extraction mechanism and 3) audio and visual features integration mechanism. The features integration mechanism combines the output features from both audio and visual extraction mechanisms to generate final classification results. In this research, the audio features mechanism is modelled by Mel-frequency Cepstrum Coefficient (MFCC) and further processed by RNN system, whereas the visual features mechanism is modelled by Haar-Cascade Detection with OpenCV and again, it is further processed by RNN system. Then, both of these extracted features were integrated by multimodal RNN-based features-integration mechanism. The performance in terms of the speech recognition rate and the robustness of the proposed AVSR system were evaluated using speech under clean environment and Signal Noise Ratio (SNR) levels ranging from -20 dB to 20 dB with 5 dB interval. On average, final speech recognition rate is 89% across different levels of SNR.
引用
收藏
页码:38 / 43
页数:6
相关论文
共 50 条
  • [1] RECURRENT NEURAL NETWORK TRANSDUCER FOR AUDIO-VISUAL SPEECH RECOGNITION
    Makino, Takaki
    Liao, Hank
    Assael, Yannis
    Shillingford, Brendan
    Garcia, Basilio
    Braga, Otavio
    Siohan, Olivier
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 905 - 912
  • [2] Audio-Visual (Multimodal) Speech Recognition System Using Deep Neural Network
    Paulin, Hebsibah
    Milton, R. S.
    JanakiRaman, S.
    Chandraprabha, K.
    [J]. JOURNAL OF TESTING AND EVALUATION, 2019, 47 (06) : 3963 - 3974
  • [3] Improving Audio-Visual Speech Recognition Using Gabor Recurrent Neural Networks
    Saudi, Ali S.
    Khalil, Mahmoud I.
    Abbas, Hazem M.
    [J]. MULTIMODAL PATTERN RECOGNITION OF SOCIAL SIGNALS IN HUMAN-COMPUTER-INTERACTION, MPRSS 2018, 2019, 11377 : 71 - 83
  • [4] RBF neural network mouth tracking for audio-visual speech recognition system
    Hui, LE
    Seng, KP
    Tse, KM
    [J]. TENCON 2004 - 2004 IEEE REGION 10 CONFERENCE, VOLS A-D, PROCEEDINGS: ANALOG AND DIGITAL TECHNIQUES IN ELECTRICAL ENGINEERING, 2004, : A84 - A87
  • [5] An audio-visual speech recognition system for testing new audio-visual databases
    Pao, Tsang-Long
    Liao, Wen-Yuan
    [J]. VISAPP 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOL 2, 2006, : 192 - +
  • [6] Audio-visual speech recognition using red exclusion and neural networks
    Lewis, TW
    Powers, DMW
    [J]. JOURNAL OF RESEARCH AND PRACTICE IN INFORMATION TECHNOLOGY, 2003, 35 (01): : 41 - 64
  • [7] Fuzzy-Neural-Network Based Audio-Visual Fusion for Speech Recognition
    Wu, Gin-Der
    Tsai, Hao-Shu
    [J]. 2019 1ST INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE IN INFORMATION AND COMMUNICATION (ICAIIC 2019), 2019, : 210 - 214
  • [8] Robustness of a chaotic modal neural network applied to audio-visual speech recognition
    Kabre, H
    [J]. NEURAL NETWORKS FOR SIGNAL PROCESSING VII, 1997, : 607 - 616
  • [9] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
    Tamura, Satoshi
    Ishikawa, Masato
    Hashiba, Takashi
    Takeuchi, Shin'ichi
    Hayamizu, Satoru
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
  • [10] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
    Hwang, Jung-Wook
    Park, Jeongkyun
    Park, Rae-Hong
    Park, Hyung-Min
    [J]. APPLIED ACOUSTICS, 2023, 211