Experimenting with lipreading for large vocabulary continuous speech recognition

被引：4

作者：

Palecek, Karel ^{[1
]}

机构：

[1] Tech Univ Liberec, Inst Informat Technol & Elect, Liberec 46117, Czech Republic

来源：

JOURNAL ON MULTIMODAL USER INTERFACES | 2018年 / 12卷 / 04期

关键词：

Audiovisual speech recognition; Lipreading; LVCSR; AUDIOVISUAL SPEECH;

D O I：

10.1007/s12193-018-0266-2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vast majority of current research in the area of audiovisual speech recognition via lipreading from frontal face videos focuses on simple cases such as isolated phrase recognition or structured speech, where the vocabulary is limited to several tens of units. In this paper, we diverge from these traditional applications and investigate the effect of incorporating the visual and also depth information in the task of continuous speech recognition with vocabulary size ranging from several hundred to half a million words. To this end, we evaluate various visual speech parametrizations, both existing and novel, that are designed to capture different kind of information in the video and depth signals. The experiments are conducted on a moderate sized dataset of 54 speakers, each uttering 100 sentences in Czech language. Both the video and depth data was captured by the Microsoft Kinect device. We show that even for large vocabularies the visual signal contains enough information to improve the word accuracy up to 22% relatively to the acoustic-only recognition. Somewhat surprisingly, a relative improvement of up to 16% has also been reached using the interpolated depth data.

引用

页码：309 / 318

页数：10

共 50 条

[41] Integrating Stress Information in Large Vocabulary Continuous Speech Recognition
Ludusan, Bogdan
Ziegler, Stefan
Gravier, Guillaume
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2641 - 2644
[42] IMPROVEMENTS ON BOTTLENECK FEATURE FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION
Tuerxun, Maimaitiaili
Zhang, Shiliang
Bao, Yebo
Dai, Lirong
2014 12TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP), 2014, : 516 - 520
[43] A LAYERED APPROACH FOR DUTCH LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION
Pelemans, Joris
Demuynck, Kris
Wambacq, Patrick
2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4421 - 4424
[44] JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research
Itou, Katunobu
Yamamoto, Mikio
Takeda, Kazuya
Takezawa, Toshiyuki
Matsuoka, Tatsuo
Kobayashi, Tetsunori
Shikano, Kiyohiro
Itahashi, Shuichi
Journal of the Acoustical Society of Japan (E) (English translation of Nippon Onkyo Gakkaishi), 1999, 20 (03): : 199 - 206
[45] Visual information assisted mandarin large vocabulary continuous speech recognition
Liu, P
Wang, ZY
2003 INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, PROCEEDINGS, 2003, : 72 - 77
[46] An efficient search space representation for large vocabulary continuous speech recognition
Demuynck, K
Duchateau, J
Van Compernolle, D
Wambacq, P
SPEECH COMMUNICATION, 2000, 30 (01) : 37 - 53
[47] Integrating induced probability into decoding for large vocabulary continuous speech recognition
Yang, Zhanlei
Liu, Wenju
Chao, Hao
Shengxue Xuebao/Acta Acustica, 2012, 37 (02): : 209 - 217
[48] Speaker adaptation in the philips system for large vocabulary continuous speech recognition
Thelen, E
Aubert, X
Beyerlein, P
1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 1035 - 1038
[49] ARTICULATORY INFORMATION AND MULTIVIEW FEATURES FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION
Mitra, Vikramjit
Wang, Wen
Bartels, Chris
Franco, Horacio
Vergyri, Dimitra
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5634 - 5638
[50] Combining spectral representations for large-vocabulary continuous speech recognition
Garau, Giulia
Renals, Steve
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2008, 16 (03): : 508 - 518

← 1 2 3 4 5 →