Usefulness of glottal excitation source information for audio-visual speech recognition system

被引：0

作者：

Nandakishor S. ^{[1
]}

Pati D. ^{[1
]}

机构：

[1] Department of ECE, NIT Nagaland, Chumukedima, Nagaland, Dimapur

来源：

International Journal of Speech Technology | 2023年 / 26卷 / 04期

关键词：

DNN-HMM sMBR; GFD; GMFCC; IAIF; MFCC;

D O I：

10.1007/s10772-023-10060-x

中图分类号：

学科分类号：

摘要：

In this work, the excitation source based glottal information is explored as a supplementary evidence for developing robust audio-visual speech recognition system. The commonly used audio feature mel-frequency cepstral coefficient (MFCC) manifest the vocal-tract information, but not about the excitation source information. We use the glottal information in together with MFCC and visual feature (lips movements) for our objectives. Iterative Adaptive Inverse Filtering (IAIF) method is used to estimate the glottal flow derivative (GFD), and standard mel-frequency cepstral processing approach is applied to obtain glottal mel-frequency cepstral coefficient (GMFCC). The DNN-HMM State Level Minimum Bayes Risk (DNN-HMM sMBR) is used to build the audio-visual speech recognition model. In our experimental analysis, we observe some English alphabet letters like ‘P’ and ‘B’ are confused by machine, when only MFCC or combination of MFCC and lip movements features are used. It may be due to the similar vocal-tract activities, or due to similar lips movements of sound units ‘p’ and ‘b’. The English letters ‘P’ and ‘B’ are distinguished when we include the glottal excitation information in together with vocal-tract and visual features. The conventional audio-visual feature; MFCC and lip movements information provides 82.76%, whereas the inclusion of GMFCC information increases the performance to 84.77%. These experimental observations reflect the usefulness of excitation source information for the development of a robust audio-visual speech recognition system. We also observed that the glottal excitation source information is robust to additive noise and found effective for audio-visual speech recognition system. © 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.

引用

页码：933 / 945

页数：12

共 50 条

[21] Bimodal fusion in audio-visual speech recognition
Zhang, XZ
Mersereau, RM
Clements, M
[J]. 2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
[22] Audio-Visual Speech Recognition System Using Recurrent Neural Network
Goh, Yeh-Huann
Lau, Kai-Xian
Lee, Yoon-Ket
[J]. PROCEEDINGS OF THE 2019 4TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY (INCIT): ENCOMPASSING INTELLIGENT TECHNOLOGY AND INNOVATION TOWARDS THE NEW ERA OF HUMAN LIFE, 2019, : 38 - 43
[23] Robot Command Interface Using an Audio-Visual Speech Recognition System
Ceballos, Alexander
Gomez, Juan
Prieto, Flavio
Redarce, Tanneguy
[J]. PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, PROCEEDINGS, 2009, 5856 : 869 - +
[24] Lip Tracking Method for the System of Audio-Visual Polish Speech Recognition
Kubanek, Mariusz
Bobulski, Janusz
Adrjanowicz, Lukasz
[J]. ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, PT I, 2012, 7267 : 535 - 542
[25] AUDIO-VISUAL SPEECH RECOGNITION INCORPORATING FACIAL DEPTH INFORMATION CAPTURED BY THE KINECT
Galatas, Georgios
Potamianos, Gerasimos
Makedon, Fillia
[J]. 2012 PROCEEDINGS OF THE 20TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2012, : 2714 - 2717
[26] FUSING INFORMATION STREAMS IN END-TO-END AUDIO-VISUAL SPEECH RECOGNITION
Yu, Wentao
Zeiler, Steffen
Kolossa, Dorothea
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3430 - 3434
[27] Audio-visual speech recognition using lstm and cnn
El Maghraby, Eslam E.
Gody, Amr M.
Farouk, M. Hesham
[J]. Recent Advances in Computer Science and Communications, 2021, 14 (06) : 2023 - 2039
[28] Speaker independent audio-visual continuous speech recognition
Liang, LH
Liu, XX
Zhao, YB
Pi, XB
Nefian, AV
[J]. IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : A25 - A28
[29] Building a data corpus for audio-visual speech recognition
Chitu, Alin G.
Rothkrantz, Leon J. M.
[J]. EUROMEDIA '2007, 2007, : 88 - 92
[30] Audio-visual fuzzy fusion for robust speech recognition
Malcangi, M.
Ouazzane, K.
Patel, P.
[J]. 2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2013,

← 1 2 3 4 5 →