Combining audio and visual speech recognition using LSTM and deep convolutional neural network

被引:12
|
作者
Shashidhar R. [1 ]
Patilkulkarni S. [1 ]
Puneeth S.B. [2 ]
机构
[1] Department of Electronics and Communication Engineering, JSS Science and Technology University, Mysore
[2] Departments of Electronics and Communication Engineering, Presidency University, Bangalore
关键词
Audio-visual speech recognition; Custom Dataset; DNN; Lip-reading; LSTM;
D O I
10.1007/s41870-022-00907-y
中图分类号
学科分类号
摘要
Human speech is bimodal, whereas audio speech relates to the speaker's acoustic waveform. Lip motions are referred to as visual speech. Audiovisual Speech Recognition is one of the emerging fields of research, particularly when audio is corrupted by noise. In the proposed AVSR system, a custom dataset was designed for English Language. Mel Frequency Cepstral Coefficients technique was used for audio processing and the Long Short-Term Memory (LSTM) method for visual speech recognition. Finally, integrate the audio and visual into a single platform using a deep neural network. From the result, it was evident that the accuracy was 90% for audio speech recognition, 71% for visual speech recognition, and 91% for audiovisual speech recognition, the result was better than the existing approaches. Ultimately model was skilled at enchanting many suitable decisions while forecasting the spoken word for the dataset that was used. © 2022, The Author(s), under exclusive licence to Bharati Vidyapeeth's Institute of Computer Applications and Management.
引用
收藏
页码:3425 / 3436
页数:11
相关论文
共 50 条
  • [31] Audio Visual automatic Speech Recognition using Multi-tasking Learning of Deep Neural Networks
    Pahuja, Hunny
    Ranjan, Priya
    Ujlayan, Amit
    [J]. 2017 INTERNATIONAL CONFERENCE ON INFOCOM TECHNOLOGIES AND UNMANNED SYSTEMS (TRENDS AND FUTURE DIRECTIONS) (ICTUS), 2017, : 455 - 458
  • [32] Speech recognition for people with dysphasia using convolutional neural network
    Lin, Bo-Yu
    Huang, Hung-Shing
    Sheu, Ruey-Kai
    Chang, Yue-Shan
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2018, : 2164 - 2169
  • [33] Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching
    Zhang, Shiqing
    Zhang, Shiliang
    Huang, Tiejun
    Gao, Wen
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (06) : 1576 - 1590
  • [34] Audio-to-Visual Speech Conversion using Deep Neural Networks
    Taylor, Sarah
    Kato, Akihiro
    Matthews, Lain
    Milner, Ben
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1482 - 1486
  • [35] Audio-Visual Speech Enhancement using Deep Neural Networks
    Hou, Jen-Cheng
    Wang, Syu-Siang
    Lai, Ying-Hui
    Lin, Jen-Chun
    Tsao, Yu
    Chang, Hsiu-Wen
    Wang, Hsin-Min
    [J]. 2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,
  • [36] Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network
    Farooq, Misbah
    Hussain, Fawad
    Baloch, Naveed Khan
    Raja, Fawad Riasat
    Yu, Heejung
    Zikria, Yousaf Bin
    [J]. SENSORS, 2020, 20 (21) : 1 - 18
  • [37] Dari Speech Classification Using Deep Convolutional Neural Network
    Dawodi, Mursal
    Baktash, Jawid Ahamd
    Wada, Tomohisa
    Alam, Najwa
    Joya, Mohammad Zarif
    [J]. 2020 IEEE INTERNATIONAL IOT, ELECTRONICS AND MECHATRONICS CONFERENCE (IEMTRONICS 2020), 2020, : 110 - 113
  • [38] Combining Deep Convolutional Neural Network and SVM for SAR Image Target Recognition
    Gao, Fei
    Huang, Teng
    Wang, Jun
    Sun, Jinping
    Yang, Erfu
    Hussain, Amir
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON INTERNET OF THINGS (ITHINGS) AND IEEE GREEN COMPUTING AND COMMUNICATIONS (GREENCOM) AND IEEE CYBER, PHYSICAL AND SOCIAL COMPUTING (CPSCOM) AND IEEE SMART DATA (SMARTDATA), 2017, : 1082 - 1085
  • [39] Deep Neural Network for Automatic Speech Recognition from Indonesian Audio using Several Lexicon Types
    Abidin, Taufik Fuadi
    Misbullah, Alim
    Ferdhiana, Ridha
    Aksana, Muammar Zikri
    Farsiah, Laina
    [J]. 2020 INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS (ICELTICS 2020), 2020, : 113 - 117
  • [40] Implementation of Convolutional Neural Network for Speech Recognition
    Wang, Zhichao
    Na, Xingyu
    Liu, Yong
    Pan, Jielin
    Yan, Yonghong
    [J]. INTERNATIONAL ACADEMIC CONFERENCE ON THE INFORMATION SCIENCE AND COMMUNICATION ENGINEERING (ISCE 2014), 2014, : 239 - 243