An audio-visual corpus for multimodal automatic speech recognition

被引：57

作者：

Czyzewski, Andrzej ^{[1
]}

Kostek, Bozena ^{[2
]}

Bratoszewski, Piotr ^{[1
]}

Kotus, Jozef ^{[1
]}

Szykulski, Marcin ^{[1
]}

机构：

[1] Gdansk Univ Technol, Fac Elect Telecommun & Informat, Multimedia Syst Dept, Ul Narutowicza 11-12, PL-80233 Gdansk, Poland

[2] Gdansk Univ Technol, Fac Elect Telecommun & Informat, Audio Acoust Lab, Ul Narutowicza 11-12, PL-80233 Gdansk, Poland

来源：

JOURNAL OF INTELLIGENT INFORMATION SYSTEMS | 2017年 / 49卷 / 02期

关键词：

MODALITY corpus; English language corpus; Speech recognition; AVSR; DATABASE; HEARING;

D O I：

10.1007/s10844-016-0438-z

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A review of available audio-visual speech corpora and a description of a new multimodal corpus of English speech recordings is provided. The new corpus containing 31 hours of recordings was created specifically to assist audio-visual speech recognition systems (AVSR) development. The database related to the corpus includes high-resolution, high-framerate stereoscopic video streams from RGB cameras, depth imaging stream utilizing Time-of-Flight camera accompanied by audio recorded using both: a microphone array and a microphone built in a mobile computer. For the purpose of applications related to AVSR systems training, every utterance was manually labeled, resulting in label files added to the corpus repository. Owing to the inclusion of recordings made in noisy conditions the elaborated corpus can also be used for testing robustness of speech recognition systems in the presence of acoustic background noise. The process of building the corpus, including the recording, labeling and post-processing phases is described in the paper. Results achieved with the developed audio-visual automatic speech recognition (ASR) engine trained and tested with the material contained in the corpus are presented and discussed together with comparative test results employing a state-of-the-art/commercial ASR engine. In order to demonstrate the practical use of the corpus it is made available for the public use.

引用

页码：167 / 192

页数：26

共 50 条

[41] Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition
Pan, Xichen
Chen, Peiyu
Gong, Yichen
Zhou, Helong
Wang, Xinbing
Lin, Zhouhan
[J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4491 - 4503
[42] Multimodal English corpus for automatic speech recognition
Kunka, Bartosz
Kupryjanow, Adam
Dalka, Piotr
Bratoszewski, Piotr
Szczodrak, Maciej
Spaleniak, Pawel
Szykulski, Marcin
Czyzewski, Andrzej
[J]. 2013 SIGNAL PROCESSING: ALGORITHMS, ARCHITECTURES, ARRANGEMENTS, AND APPLICATIONS (SPA), 2013, : 106 - 111
[43] A corpus of audio-visual Lombard speech with frontal and profile views
Alghamdi, Najwa
Maddock, Steve
Marxer, Ricard
Barker, Jon
Brown, Guy J.
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2018, 143 (06): : EL523 - EL529
[44] TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech
Harte, Naomi
Gillen, Eoin
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2015, 17 (05) : 603 - 615
[45] Audio-visual feature fusion via deep neural networks for automatic speech recognition
Rahmani, Mohammad Hasan
Almasganj, Farshad
Seyyedsalehi, Seyyed Ali
[J]. DIGITAL SIGNAL PROCESSING, 2018, 82 : 54 - 63
[46] RETRACTED: Audio-Visual Automatic Speech Recognition Towards Education for Disabilities (Retracted Article)
Debnath, Saswati
Roy, Pinki
Namasudra, Suyel
Crespo, Ruben Gonzalez
[J]. JOURNAL OF AUTISM AND DEVELOPMENTAL DISORDERS, 2023, 53 (09) : 3581 - 3594
[47] Optimizing Audio-Visual Speech Enhancement Using Multi-Level Distortion Measures for Audio-Visual Speech Recognition
Chen, Hang
Wang, Qing
Du, Jun
Yin, Bao-Cai
Pan, Jia
Lee, Chin-Hui
[J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2024, 32 : 2508 - 2521
[48] A Phone-Viseme Dynamic Bayesian Network for Audio-Visual Automatic Speech Recognition
Terry, Louis
Katsaggelos, Aggelos K.
[J]. 19TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1-6, 2008, : 2597 - 2600
[49] Speaker independent audio-visual continuous speech recognition
Liang, LH
Liu, XX
Zhao, YB
Pi, XB
Nefian, AV
[J]. IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : A25 - A28
[50] Audio-visual fuzzy fusion for robust speech recognition
Malcangi, M.
Ouazzane, K.
Patel, P.
[J]. 2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2013,

← 1 2 3 4 5 →