Usefulness of glottal excitation source information for audio-visual speech recognition system

被引:0
|
作者
Nandakishor S. [1 ]
Pati D. [1 ]
机构
[1] Department of ECE, NIT Nagaland, Chumukedima, Nagaland, Dimapur
关键词
DNN-HMM sMBR; GFD; GMFCC; IAIF; MFCC;
D O I
10.1007/s10772-023-10060-x
中图分类号
学科分类号
摘要
In this work, the excitation source based glottal information is explored as a supplementary evidence for developing robust audio-visual speech recognition system. The commonly used audio feature mel-frequency cepstral coefficient (MFCC) manifest the vocal-tract information, but not about the excitation source information. We use the glottal information in together with MFCC and visual feature (lips movements) for our objectives. Iterative Adaptive Inverse Filtering (IAIF) method is used to estimate the glottal flow derivative (GFD), and standard mel-frequency cepstral processing approach is applied to obtain glottal mel-frequency cepstral coefficient (GMFCC). The DNN-HMM State Level Minimum Bayes Risk (DNN-HMM sMBR) is used to build the audio-visual speech recognition model. In our experimental analysis, we observe some English alphabet letters like ‘P’ and ‘B’ are confused by machine, when only MFCC or combination of MFCC and lip movements features are used. It may be due to the similar vocal-tract activities, or due to similar lips movements of sound units ‘p’ and ‘b’. The English letters ‘P’ and ‘B’ are distinguished when we include the glottal excitation information in together with vocal-tract and visual features. The conventional audio-visual feature; MFCC and lip movements information provides 82.76%, whereas the inclusion of GMFCC information increases the performance to 84.77%. These experimental observations reflect the usefulness of excitation source information for the development of a robust audio-visual speech recognition system. We also observed that the glottal excitation source information is robust to additive noise and found effective for audio-visual speech recognition system. © 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
引用
收藏
页码:933 / 945
页数:12
相关论文
共 50 条
  • [21] Bimodal fusion in audio-visual speech recognition
    Zhang, XZ
    Mersereau, RM
    Clements, M
    [J]. 2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
  • [22] Audio-Visual Speech Recognition System Using Recurrent Neural Network
    Goh, Yeh-Huann
    Lau, Kai-Xian
    Lee, Yoon-Ket
    [J]. PROCEEDINGS OF THE 2019 4TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY (INCIT): ENCOMPASSING INTELLIGENT TECHNOLOGY AND INNOVATION TOWARDS THE NEW ERA OF HUMAN LIFE, 2019, : 38 - 43
  • [23] Robot Command Interface Using an Audio-Visual Speech Recognition System
    Ceballos, Alexander
    Gomez, Juan
    Prieto, Flavio
    Redarce, Tanneguy
    [J]. PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, PROCEEDINGS, 2009, 5856 : 869 - +
  • [24] Lip Tracking Method for the System of Audio-Visual Polish Speech Recognition
    Kubanek, Mariusz
    Bobulski, Janusz
    Adrjanowicz, Lukasz
    [J]. ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, PT I, 2012, 7267 : 535 - 542
  • [25] AUDIO-VISUAL SPEECH RECOGNITION INCORPORATING FACIAL DEPTH INFORMATION CAPTURED BY THE KINECT
    Galatas, Georgios
    Potamianos, Gerasimos
    Makedon, Fillia
    [J]. 2012 PROCEEDINGS OF THE 20TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2012, : 2714 - 2717
  • [26] FUSING INFORMATION STREAMS IN END-TO-END AUDIO-VISUAL SPEECH RECOGNITION
    Yu, Wentao
    Zeiler, Steffen
    Kolossa, Dorothea
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3430 - 3434
  • [27] Audio-visual speech recognition using lstm and cnn
    El Maghraby, Eslam E.
    Gody, Amr M.
    Farouk, M. Hesham
    [J]. Recent Advances in Computer Science and Communications, 2021, 14 (06) : 2023 - 2039
  • [28] Speaker independent audio-visual continuous speech recognition
    Liang, LH
    Liu, XX
    Zhao, YB
    Pi, XB
    Nefian, AV
    [J]. IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : A25 - A28
  • [29] Building a data corpus for audio-visual speech recognition
    Chitu, Alin G.
    Rothkrantz, Leon J. M.
    [J]. EUROMEDIA '2007, 2007, : 88 - 92
  • [30] Audio-visual fuzzy fusion for robust speech recognition
    Malcangi, M.
    Ouazzane, K.
    Patel, P.
    [J]. 2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2013,