Audio-Visual Speech Recognition System Using Recurrent Neural Network

被引:0
|
作者
Goh, Yeh-Huann [1 ]
Lau, Kai-Xian [1 ]
Lee, Yoon-Ket [1 ]
机构
[1] Tunku Abdul Rahman Univ Coll, Fac Engn & Technol, Jalan Genting Kelang, Kuala Lumpur 53300, Malaysia
关键词
Recurrent Neural Network; Speech Recognition System; Audio-visual Speech Recognition System; Features Integration Mechanism; Audio Features Extraction; Mechanism;
D O I
10.1109/incit.2019.8912049
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
An audio-visual speech recognition system (AVSR) integrates audio and visual information to perform speech recognition task. The AVSR has various applications in practice especially in natural language processing systems such as speech-to-text conversion, automatic translation and sentiment analysis. Decades ago, researchers tend to use Hidden Markov Model (HMM) to construct speech recognition system due to its good achievements in success recognition rate. However, HMM's training dataset is enormous in order to have sufficient linguistic coverage. Besides, its recognition rate under noisy environments is not satisfying. To overcome this deficiency, a Recurrent Neural Network (RNN) based AVSR is proposed. The proposed AVSR model consists of three components: 1) audio features extraction mechanism, 2) visual features extraction mechanism and 3) audio and visual features integration mechanism. The features integration mechanism combines the output features from both audio and visual extraction mechanisms to generate final classification results. In this research, the audio features mechanism is modelled by Mel-frequency Cepstrum Coefficient (MFCC) and further processed by RNN system, whereas the visual features mechanism is modelled by Haar-Cascade Detection with OpenCV and again, it is further processed by RNN system. Then, both of these extracted features were integrated by multimodal RNN-based features-integration mechanism. The performance in terms of the speech recognition rate and the robustness of the proposed AVSR system were evaluated using speech under clean environment and Signal Noise Ratio (SNR) levels ranging from -20 dB to 20 dB with 5 dB interval. On average, final speech recognition rate is 89% across different levels of SNR.
引用
收藏
页码:38 / 43
页数:6
相关论文
共 50 条
  • [31] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
    Su, Rongfeng
    Wang, Lan
    Liu, Xunying
    [J]. 2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
  • [32] Lip movement synthesis in audio-visual speech recognition system
    Li, Junquan
    Yin, Yixin
    [J]. Proc. 2005 IEEE Int. Conf. on Lang. Process. Knowl. Engin. IEEE NLP-KE '05, (461-465):
  • [33] Audio-Visual Speech Modeling for Continuous Speech Recognition
    Dupont, Stephane
    Luettin, Juergen
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) : 141 - 151
  • [34] Enhancing Quality and Accuracy of Speech Recognition System by Using Multimodal Audio-Visual Speech signal
    El Maghraby, Eslam E.
    Gody, Amr M.
    Farouk, M. Hesham
    [J]. ICENCO 2016 - 2016 12TH INTERNATIONAL COMPUTER ENGINEERING CONFERENCE (ICENCO) - BOUNDLESS SMART SOCIETIES, 2016, : 219 - 229
  • [35] Large vocabulary audio-visual speech recognition using the Janus speech recognition toolkit
    Kratt, J
    Metze, F
    Stiefelhagen, R
    Waibel, A
    [J]. PATTERN RECOGNITION, 2004, 3175 : 488 - 495
  • [36] Audio Visual Speech Recognition with Multimodal Recurrent Neural Networks
    Feng, Weijiang
    Guan, Naiyang
    Li, Yuan
    Zhang, Xiang
    Luo, Zhigang
    [J]. 2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 681 - 688
  • [37] A coupled HMM for audio-visual speech recognition
    Nefian, AV
    Liang, LH
    Pi, XB
    Xiaoxiang, L
    Mao, C
    Murphy, K
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2013 - 2016
  • [38] Speaker independent audio-visual speech recognition
    Zhang, Y
    Levinson, S
    Huang, T
    [J]. 2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
  • [39] An asynchronous DBN for audio-visual speech recognition
    Saenko, Kate
    Livescu, Karen
    [J]. 2006 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, 2006, : 154 - +
  • [40] Audio-visual modeling for bimodal speech recognition
    Kaynak, MN
    Zhi, Q
    Cheok, AD
    Sengupta, K
    Chung, KC
    [J]. 2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 181 - 186