An audio-visual corpus for multimodal automatic speech recognition

被引:57
|
作者
Czyzewski, Andrzej [1 ]
Kostek, Bozena [2 ]
Bratoszewski, Piotr [1 ]
Kotus, Jozef [1 ]
Szykulski, Marcin [1 ]
机构
[1] Gdansk Univ Technol, Fac Elect Telecommun & Informat, Multimedia Syst Dept, Ul Narutowicza 11-12, PL-80233 Gdansk, Poland
[2] Gdansk Univ Technol, Fac Elect Telecommun & Informat, Audio Acoust Lab, Ul Narutowicza 11-12, PL-80233 Gdansk, Poland
关键词
MODALITY corpus; English language corpus; Speech recognition; AVSR; DATABASE; HEARING;
D O I
10.1007/s10844-016-0438-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A review of available audio-visual speech corpora and a description of a new multimodal corpus of English speech recordings is provided. The new corpus containing 31 hours of recordings was created specifically to assist audio-visual speech recognition systems (AVSR) development. The database related to the corpus includes high-resolution, high-framerate stereoscopic video streams from RGB cameras, depth imaging stream utilizing Time-of-Flight camera accompanied by audio recorded using both: a microphone array and a microphone built in a mobile computer. For the purpose of applications related to AVSR systems training, every utterance was manually labeled, resulting in label files added to the corpus repository. Owing to the inclusion of recordings made in noisy conditions the elaborated corpus can also be used for testing robustness of speech recognition systems in the presence of acoustic background noise. The process of building the corpus, including the recording, labeling and post-processing phases is described in the paper. Results achieved with the developed audio-visual automatic speech recognition (ASR) engine trained and tested with the material contained in the corpus are presented and discussed together with comparative test results employing a state-of-the-art/commercial ASR engine. In order to demonstrate the practical use of the corpus it is made available for the public use.
引用
收藏
页码:167 / 192
页数:26
相关论文
共 50 条
  • [31] Audio-Visual Automatic Speech Recognition Using PZM, MFCC and Statistical Analysis
    Debnath, Saswati
    Roy, Pinki
    [J]. INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2021, 7 (02): : 121 - 133
  • [32] Speaker independent audio-visual speech recognition
    Zhang, Y
    Levinson, S
    Huang, T
    [J]. 2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
  • [33] A coupled HMM for audio-visual speech recognition
    Nefian, AV
    Liang, LH
    Pi, XB
    Xiaoxiang, L
    Mao, C
    Murphy, K
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2013 - 2016
  • [34] RETRACTED ARTICLE: Audio-Visual Automatic Speech Recognition Towards Education for Disabilities
    Saswati Debnath
    Pinki Roy
    Suyel Namasudra
    Ruben Gonzalez Crespo
    [J]. Journal of Autism and Developmental Disorders, 2023, 53 : 3581 - 3594
  • [35] Retraction Note: Audio-Visual Automatic Speech Recognition Towards Education for Disabilities
    Saswati Debnath
    Pinki Roy
    Suyel Namasudra
    Ruben Gonzalez Crespo
    [J]. Journal of Autism and Developmental Disorders, 2024, 54 : 1627 - 1627
  • [36] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
    Sterpu, George
    Saam, Christian
    Harte, Naomi
    [J]. ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 111 - 115
  • [37] An asynchronous DBN for audio-visual speech recognition
    Saenko, Kate
    Livescu, Karen
    [J]. 2006 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, 2006, : 154 - +
  • [38] Audio-visual modeling for bimodal speech recognition
    Kaynak, MN
    Zhi, Q
    Cheok, AD
    Sengupta, K
    Chung, KC
    [J]. 2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 181 - 186
  • [39] Statistical multimodal integration for audio-visual speech processing
    Nakamura, S
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 2002, 13 (04): : 854 - 866
  • [40] Bimodal fusion in audio-visual speech recognition
    Zhang, XZ
    Mersereau, RM
    Clements, M
    [J]. 2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967