Combining audio and visual speech recognition using LSTM and deep convolutional neural network

被引:12
|
作者
Shashidhar R. [1 ]
Patilkulkarni S. [1 ]
Puneeth S.B. [2 ]
机构
[1] Department of Electronics and Communication Engineering, JSS Science and Technology University, Mysore
[2] Departments of Electronics and Communication Engineering, Presidency University, Bangalore
关键词
Audio-visual speech recognition; Custom Dataset; DNN; Lip-reading; LSTM;
D O I
10.1007/s41870-022-00907-y
中图分类号
学科分类号
摘要
Human speech is bimodal, whereas audio speech relates to the speaker's acoustic waveform. Lip motions are referred to as visual speech. Audiovisual Speech Recognition is one of the emerging fields of research, particularly when audio is corrupted by noise. In the proposed AVSR system, a custom dataset was designed for English Language. Mel Frequency Cepstral Coefficients technique was used for audio processing and the Long Short-Term Memory (LSTM) method for visual speech recognition. Finally, integrate the audio and visual into a single platform using a deep neural network. From the result, it was evident that the accuracy was 90% for audio speech recognition, 71% for visual speech recognition, and 91% for audiovisual speech recognition, the result was better than the existing approaches. Ultimately model was skilled at enchanting many suitable decisions while forecasting the spoken word for the dataset that was used. © 2022, The Author(s), under exclusive licence to Bharati Vidyapeeth's Institute of Computer Applications and Management.
引用
收藏
页码:3425 / 3436
页数:11
相关论文
共 50 条
  • [1] Dysarthric Speech Recognition Using Convolutional LSTM Neural Network
    Kim, Myungjong
    Cao, Beiming
    An, Kwanghoon
    Wang, Jun
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2948 - 2952
  • [2] Audio-Visual (Multimodal) Speech Recognition System Using Deep Neural Network
    Paulin, Hebsibah
    Milton, R. S.
    JanakiRaman, S.
    Chandraprabha, K.
    [J]. JOURNAL OF TESTING AND EVALUATION, 2019, 47 (06) : 3963 - 3974
  • [3] Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
    Zhang, Shiqing
    Zhang, Shiliang
    Huang, Tiejun
    Gao, Wen
    [J]. ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 281 - 284
  • [4] Audio Visual Speech Recognition Using Deep Recurrent Neural Networks
    Thanda, Abhinav
    Venkatesan, Shankar M.
    [J]. MULTIMODAL PATTERN RECOGNITION OF SOCIAL SIGNALS IN HUMAN-COMPUTER-INTERACTION, MPRSS 2016, 2017, 10183 : 98 - 109
  • [5] Audio-visual speech recognition using lstm and cnn
    El Maghraby E.E.
    Gody A.M.
    Farouk M.H.
    [J]. Recent Advances in Computer Science and Communications, 2021, 14 (06) : 2023 - 2039
  • [6] Visual Speech Recognition of Korean Words Using Convolutional Neural Network
    Lee, Sung-Won
    Yu, Je-Hun
    Park, Seung Min
    Sim, Kwee-Bo
    [J]. INTERNATIONAL JOURNAL OF FUZZY LOGIC AND INTELLIGENT SYSTEMS, 2019, 19 (01) : 1 - 9
  • [7] Maxout neurons for deep convolutional and LSTM neural networks in speech recognition
    Cai, Meng
    Liu, Jia
    [J]. SPEECH COMMUNICATION, 2016, 77 : 53 - 64
  • [8] Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks
    Hou, Jen-Cheng
    Wang, Syu-Siang
    Lai, Ying-Hui
    Tsao, Yu
    Chang, Hsiu-Wen
    Wang, Hsin-Min
    [J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2018, 2 (02): : 117 - 128
  • [9] Deep Convolutional Neural Network for Arabic Speech Recognition
    Amari, Rafik
    Noubigh, Zouhaira
    Zrigui, Salah
    Berchech, Dhaou
    Nicolas, Henri
    Zrigui, Mounir
    [J]. COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2022, 2022, 13501 : 120 - 134
  • [10] Audio-Visual Speech Recognition System Using Recurrent Neural Network
    Goh, Yeh-Huann
    Lau, Kai-Xian
    Lee, Yoon-Ket
    [J]. PROCEEDINGS OF THE 2019 4TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY (INCIT): ENCOMPASSING INTELLIGENT TECHNOLOGY AND INNOVATION TOWARDS THE NEW ERA OF HUMAN LIFE, 2019, : 38 - 43