An audio-visual corpus for multimodal automatic speech recognition

被引:57
|
作者
Czyzewski, Andrzej [1 ]
Kostek, Bozena [2 ]
Bratoszewski, Piotr [1 ]
Kotus, Jozef [1 ]
Szykulski, Marcin [1 ]
机构
[1] Gdansk Univ Technol, Fac Elect Telecommun & Informat, Multimedia Syst Dept, Ul Narutowicza 11-12, PL-80233 Gdansk, Poland
[2] Gdansk Univ Technol, Fac Elect Telecommun & Informat, Audio Acoust Lab, Ul Narutowicza 11-12, PL-80233 Gdansk, Poland
关键词
MODALITY corpus; English language corpus; Speech recognition; AVSR; DATABASE; HEARING;
D O I
10.1007/s10844-016-0438-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A review of available audio-visual speech corpora and a description of a new multimodal corpus of English speech recordings is provided. The new corpus containing 31 hours of recordings was created specifically to assist audio-visual speech recognition systems (AVSR) development. The database related to the corpus includes high-resolution, high-framerate stereoscopic video streams from RGB cameras, depth imaging stream utilizing Time-of-Flight camera accompanied by audio recorded using both: a microphone array and a microphone built in a mobile computer. For the purpose of applications related to AVSR systems training, every utterance was manually labeled, resulting in label files added to the corpus repository. Owing to the inclusion of recordings made in noisy conditions the elaborated corpus can also be used for testing robustness of speech recognition systems in the presence of acoustic background noise. The process of building the corpus, including the recording, labeling and post-processing phases is described in the paper. Results achieved with the developed audio-visual automatic speech recognition (ASR) engine trained and tested with the material contained in the corpus are presented and discussed together with comparative test results employing a state-of-the-art/commercial ASR engine. In order to demonstrate the practical use of the corpus it is made available for the public use.
引用
收藏
页码:167 / 192
页数:26
相关论文
共 50 条
  • [41] Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition
    Pan, Xichen
    Chen, Peiyu
    Gong, Yichen
    Zhou, Helong
    Wang, Xinbing
    Lin, Zhouhan
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4491 - 4503
  • [42] Multimodal English corpus for automatic speech recognition
    Kunka, Bartosz
    Kupryjanow, Adam
    Dalka, Piotr
    Bratoszewski, Piotr
    Szczodrak, Maciej
    Spaleniak, Pawel
    Szykulski, Marcin
    Czyzewski, Andrzej
    [J]. 2013 SIGNAL PROCESSING: ALGORITHMS, ARCHITECTURES, ARRANGEMENTS, AND APPLICATIONS (SPA), 2013, : 106 - 111
  • [43] A corpus of audio-visual Lombard speech with frontal and profile views
    Alghamdi, Najwa
    Maddock, Steve
    Marxer, Ricard
    Barker, Jon
    Brown, Guy J.
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2018, 143 (06): : EL523 - EL529
  • [44] TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech
    Harte, Naomi
    Gillen, Eoin
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2015, 17 (05) : 603 - 615
  • [45] Audio-visual feature fusion via deep neural networks for automatic speech recognition
    Rahmani, Mohammad Hasan
    Almasganj, Farshad
    Seyyedsalehi, Seyyed Ali
    [J]. DIGITAL SIGNAL PROCESSING, 2018, 82 : 54 - 63
  • [46] RETRACTED: Audio-Visual Automatic Speech Recognition Towards Education for Disabilities (Retracted Article)
    Debnath, Saswati
    Roy, Pinki
    Namasudra, Suyel
    Crespo, Ruben Gonzalez
    [J]. JOURNAL OF AUTISM AND DEVELOPMENTAL DISORDERS, 2023, 53 (09) : 3581 - 3594
  • [47] Optimizing Audio-Visual Speech Enhancement Using Multi-Level Distortion Measures for Audio-Visual Speech Recognition
    Chen, Hang
    Wang, Qing
    Du, Jun
    Yin, Bao-Cai
    Pan, Jia
    Lee, Chin-Hui
    [J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2024, 32 : 2508 - 2521
  • [48] A Phone-Viseme Dynamic Bayesian Network for Audio-Visual Automatic Speech Recognition
    Terry, Louis
    Katsaggelos, Aggelos K.
    [J]. 19TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1-6, 2008, : 2597 - 2600
  • [49] Speaker independent audio-visual continuous speech recognition
    Liang, LH
    Liu, XX
    Zhao, YB
    Pi, XB
    Nefian, AV
    [J]. IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : A25 - A28
  • [50] Audio-visual fuzzy fusion for robust speech recognition
    Malcangi, M.
    Ouazzane, K.
    Patel, P.
    [J]. 2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2013,