Usefulness of glottal excitation source information for audio-visual speech recognition system

被引:0
|
作者
Nandakishor S. [1 ]
Pati D. [1 ]
机构
[1] Department of ECE, NIT Nagaland, Chumukedima, Nagaland, Dimapur
关键词
DNN-HMM sMBR; GFD; GMFCC; IAIF; MFCC;
D O I
10.1007/s10772-023-10060-x
中图分类号
学科分类号
摘要
In this work, the excitation source based glottal information is explored as a supplementary evidence for developing robust audio-visual speech recognition system. The commonly used audio feature mel-frequency cepstral coefficient (MFCC) manifest the vocal-tract information, but not about the excitation source information. We use the glottal information in together with MFCC and visual feature (lips movements) for our objectives. Iterative Adaptive Inverse Filtering (IAIF) method is used to estimate the glottal flow derivative (GFD), and standard mel-frequency cepstral processing approach is applied to obtain glottal mel-frequency cepstral coefficient (GMFCC). The DNN-HMM State Level Minimum Bayes Risk (DNN-HMM sMBR) is used to build the audio-visual speech recognition model. In our experimental analysis, we observe some English alphabet letters like ‘P’ and ‘B’ are confused by machine, when only MFCC or combination of MFCC and lip movements features are used. It may be due to the similar vocal-tract activities, or due to similar lips movements of sound units ‘p’ and ‘b’. The English letters ‘P’ and ‘B’ are distinguished when we include the glottal excitation information in together with vocal-tract and visual features. The conventional audio-visual feature; MFCC and lip movements information provides 82.76%, whereas the inclusion of GMFCC information increases the performance to 84.77%. These experimental observations reflect the usefulness of excitation source information for the development of a robust audio-visual speech recognition system. We also observed that the glottal excitation source information is robust to additive noise and found effective for audio-visual speech recognition system. © 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
引用
收藏
页码:933 / 945
页数:12
相关论文
共 50 条
  • [1] An audio-visual speech recognition system for testing new audio-visual databases
    Pao, Tsang-Long
    Liao, Wen-Yuan
    [J]. VISAPP 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOL 2, 2006, : 192 - +
  • [2] Multistage information fusion for audio-visual speech recognition
    Chu, SM
    Libal, V
    Marcheret, E
    Neti, C
    Potamianos, G
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXP (ICME), VOLS 1-3, 2004, : 1651 - 1654
  • [3] Information Fusion Techniques in Audio-Visual Speech Recognition
    Karabalkan, H.
    Erdogan, H.
    [J]. 2009 IEEE 17TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, VOLS 1 AND 2, 2009, : 734 - 737
  • [4] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
    Hwang, Jung-Wook
    Park, Jeongkyun
    Park, Rae-Hong
    Park, Hyung-Min
    [J]. APPLIED ACOUSTICS, 2023, 211
  • [5] Lips Detection for Audio-Visual Speech Recognition System
    Chin, Siew Wen
    Ang, Li-Minn
    Seng, Kah Phooi
    [J]. 2008 INTERNATIONAL SYMPOSIUM ON INTELLIGENT SIGNAL PROCESSING AND COMMUNICATIONS SYSTEMS (ISPACS 2008), 2008, : 311 - 314
  • [6] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    [J]. INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [7] Information Theoretic Feature Extraction for Audio-Visual Speech Recognition
    Gurban, Mihai
    Thiran, Jean-Philippe
    [J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2009, 57 (12) : 4765 - 4776
  • [8] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
  • [9] MULTIPOSE AUDIO-VISUAL SPEECH RECOGNITION
    Estellers, Virginia
    Thiran, Jean-Philippe
    [J]. 19TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2011), 2011, : 1065 - 1069
  • [10] Audio-visual integration for speech recognition
    Kober, R
    Harz, U
    [J]. NEUROLOGY PSYCHIATRY AND BRAIN RESEARCH, 1996, 4 (04) : 179 - 184