Combining audio and visual speech recognition using LSTM and deep convolutional neural network

被引：12

作者：

Shashidhar R. ^{[1
]}

Patilkulkarni S. ^{[1
]}

Puneeth S.B. ^{[2
]}

机构：

[1] Department of Electronics and Communication Engineering, JSS Science and Technology University, Mysore

[2] Departments of Electronics and Communication Engineering, Presidency University, Bangalore

来源：

International Journal of Information Technology | 2022年 / 14卷 / 7期

关键词：

Audio-visual speech recognition; Custom Dataset; DNN; Lip-reading; LSTM;

D O I：

10.1007/s41870-022-00907-y

中图分类号：

学科分类号：

摘要：

Human speech is bimodal, whereas audio speech relates to the speaker's acoustic waveform. Lip motions are referred to as visual speech. Audiovisual Speech Recognition is one of the emerging fields of research, particularly when audio is corrupted by noise. In the proposed AVSR system, a custom dataset was designed for English Language. Mel Frequency Cepstral Coefficients technique was used for audio processing and the Long Short-Term Memory (LSTM) method for visual speech recognition. Finally, integrate the audio and visual into a single platform using a deep neural network. From the result, it was evident that the accuracy was 90% for audio speech recognition, 71% for visual speech recognition, and 91% for audiovisual speech recognition, the result was better than the existing approaches. Ultimately model was skilled at enchanting many suitable decisions while forecasting the spoken word for the dataset that was used. © 2022, The Author(s), under exclusive licence to Bharati Vidyapeeth's Institute of Computer Applications and Management.

引用

页码：3425 / 3436

页数：11

共 50 条

[31] Audio Visual automatic Speech Recognition using Multi-tasking Learning of Deep Neural Networks
Pahuja, Hunny
Ranjan, Priya
Ujlayan, Amit
[J]. 2017 INTERNATIONAL CONFERENCE ON INFOCOM TECHNOLOGIES AND UNMANNED SYSTEMS (TRENDS AND FUTURE DIRECTIONS) (ICTUS), 2017, : 455 - 458
[32] Speech recognition for people with dysphasia using convolutional neural network
Lin, Bo-Yu
Huang, Hung-Shing
Sheu, Ruey-Kai
Chang, Yue-Shan
[J]. 2018 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2018, : 2164 - 2169
[33] Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching
Zhang, Shiqing
Zhang, Shiliang
Huang, Tiejun
Gao, Wen
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (06) : 1576 - 1590
[34] Audio-to-Visual Speech Conversion using Deep Neural Networks
Taylor, Sarah
Kato, Akihiro
Matthews, Lain
Milner, Ben
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1482 - 1486
[35] Audio-Visual Speech Enhancement using Deep Neural Networks
Hou, Jen-Cheng
Wang, Syu-Siang
Lai, Ying-Hui
Lin, Jen-Chun
Tsao, Yu
Chang, Hsiu-Wen
Wang, Hsin-Min
[J]. 2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,
[36] Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network
Farooq, Misbah
Hussain, Fawad
Baloch, Naveed Khan
Raja, Fawad Riasat
Yu, Heejung
Zikria, Yousaf Bin
[J]. SENSORS, 2020, 20 (21) : 1 - 18
[37] Dari Speech Classification Using Deep Convolutional Neural Network
Dawodi, Mursal
Baktash, Jawid Ahamd
Wada, Tomohisa
Alam, Najwa
Joya, Mohammad Zarif
[J]. 2020 IEEE INTERNATIONAL IOT, ELECTRONICS AND MECHATRONICS CONFERENCE (IEMTRONICS 2020), 2020, : 110 - 113
[38] Combining Deep Convolutional Neural Network and SVM for SAR Image Target Recognition
Gao, Fei
Huang, Teng
Wang, Jun
Sun, Jinping
Yang, Erfu
Hussain, Amir
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON INTERNET OF THINGS (ITHINGS) AND IEEE GREEN COMPUTING AND COMMUNICATIONS (GREENCOM) AND IEEE CYBER, PHYSICAL AND SOCIAL COMPUTING (CPSCOM) AND IEEE SMART DATA (SMARTDATA), 2017, : 1082 - 1085
[39] Deep Neural Network for Automatic Speech Recognition from Indonesian Audio using Several Lexicon Types
Abidin, Taufik Fuadi
Misbullah, Alim
Ferdhiana, Ridha
Aksana, Muammar Zikri
Farsiah, Laina
[J]. 2020 INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS (ICELTICS 2020), 2020, : 113 - 117
[40] Implementation of Convolutional Neural Network for Speech Recognition
Wang, Zhichao
Na, Xingyu
Liu, Yong
Pan, Jielin
Yan, Yonghong
[J]. INTERNATIONAL ACADEMIC CONFERENCE ON THE INFORMATION SCIENCE AND COMMUNICATION ENGINEERING (ISCE 2014), 2014, : 239 - 243

← 1 2 3 4 5 →