Language modeling and transcription of the TED corpus lectures

被引：0

作者：

Leeuwis, E ^{[1
]}

Federico, M ^{[1
]}

Cettolo, M ^{[1
]}

机构：

[1] Univ Twente, Dept Comp Sci, NL-7500 AE Enschede, Netherlands

来源：

2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING I | 2003年

关键词：

D O I：

暂无

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Transcribing lectures is a challenging task, both in acoustic and in language modeling. In this work, we present our first results on the automatic transcription of lectures from the TED corpus, recently released by ELRA and LDC. In particular, we concentrated our effort on language modeling. Baseline acoustic and language models were developed using respectively 8 hours of TED transcripts and various types of texts: conference proceedings, lecture transcripts, and conversational speech transcripts. Then, adaptation of the language model to single speakers was investigated by exploiting different kinds of information: automatic transcripts of the talk, the title of the talk, the abstract and, finally, the paper. In the last case, a 39.2% WER was achieved.

引用

页码：232 / 235

页数：4

共 50 条

[21] TED Multilingual Discourse Bank (TED-MDB): a parallel corpus annotated in the PDTB style
Deniz Zeyrek
Amália Mendes
Yulia Grishina
Murathan Kurfalı
Samuel Gibbon
Maciej Ogrodniczuk
Language Resources and Evaluation, 2020, 54 : 587 - 613
[22] WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse
Faruqui, Manaal
Pavlick, Ellie
Tenney, Ian
Das, Dipanjan
2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 305 - 315
[23] A Corpus for Modeling User and Language Effects in Argumentation on Online Debating
Durmus, Esin
Cardie, Claire
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 602 - 607
[24] TED Multilingual Discourse Bank (TED-MDB): a parallel corpus annotated in the PDTB style
Zeyrek, Deniz
Mendes, Amalia
Grishina, Yulia
Kurfali, Murathan
Gibbon, Samuel
Ogrodniczuk, Maciej
LANGUAGE RESOURCES AND EVALUATION, 2020, 54 (02) : 587 - 613
[25] The Guarani Language, Lectures
Garvin, Paul L.
INTERNATIONAL JOURNAL OF AMERICAN LINGUISTICS, 1953, 19 (02) : 156 - 159
[26] Advances in the automatic transcription of lectures
Cettolo, M
Brugnara, F
Federico, M
2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 769 - 772
[27] Language Modeling for Automatic Turkish Broadcast News Transcription
Arisoy, Ebru
Sak, Hasim
Saraclar, Murat
INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2748 - 2751
[28] Incremental language modeling for automatic transcription of broadcast news
Ohtsuki, Katsutoshi
Nguyen, Long
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2007, E90D (02): : 526 - 532
[29] Automatic processing of audio lectures for information retrieval: Vocabulary selection and language modeling
Park, A
Hazen, TJ
Glass, JR
2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5: SPEECH PROCESSING, 2005, : 497 - 500
[30] Design and implementation of an online corpus of presentation transcripts of TED Talks
Hasebe, Yoichiro
CURRENT WORK IN CORPUS LINGUISTICS: WORKING WITH TRADITIONALLY- CONCEIVED CORPORA AND BEYOND (CILC2015), 2015, 198 : 174 - 182

← 1 2 3 4 5 →