Language modeling and transcription of the TED corpus lectures

被引:0
|
作者
Leeuwis, E [1 ]
Federico, M [1 ]
Cettolo, M [1 ]
机构
[1] Univ Twente, Dept Comp Sci, NL-7500 AE Enschede, Netherlands
关键词
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Transcribing lectures is a challenging task, both in acoustic and in language modeling. In this work, we present our first results on the automatic transcription of lectures from the TED corpus, recently released by ELRA and LDC. In particular, we concentrated our effort on language modeling. Baseline acoustic and language models were developed using respectively 8 hours of TED transcripts and various types of texts: conference proceedings, lecture transcripts, and conversational speech transcripts. Then, adaptation of the language model to single speakers was investigated by exploiting different kinds of information: automatic transcripts of the talk, the title of the talk, the abstract and, finally, the paper. In the last case, a 39.2% WER was achieved.
引用
收藏
页码:232 / 235
页数:4
相关论文
共 50 条
  • [41] Votter Corpus: A Corpus of Social Polling Language
    Green, Nathan David
    Larasati, Septina Dian
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3693 - 3697
  • [42] Corpus of the Georgian Language
    Doborjginidze, Nino
    Lobzhanidze, Irina
    PROCEEDINGS OF THE XVII EURALEX INTERNATIONAL CONGRESS: LEXICOGRAPHY AND LINGUISTIC DIVERSITY, 2016, : 328 - 334
  • [43] Named Entity Recognition Modeling for the Thai Language from a Disjointedly Labeled Corpus
    Suriyachay, Kitiya
    Sornlertlamvanich, Virach
    2018 5TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATICS: CONCEPTS, THEORY AND APPLICATIONS (ICAICTA 2018), 2018, : 30 - 35
  • [44] Phrase-based statistical language Modeling from bilingual parallel corpus
    Mao, Jun
    Cheng, Gang
    He, Yanxiang
    COMBINATORICS, ALGORITHMS, PROBABILISTIC AND EXPERIMENTAL METHODOLOGIES, 2007, 4614 : 317 - +
  • [45] caWaC - A web corpus of Catalan and its application to language modeling and machine translation
    Ljubesic, Nikola
    Toral, Antonio
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 1728 - 1732
  • [46] Extending the MPC corpus to Chinese and Urdu - A Multiparty Multi-Lingual Chat Corpus for Modeling Social Phenomena in Language
    Liu, Ting
    Shaikh, Samira
    Strzalkowski, Tomek
    Broadwell, Aaron
    Stromer-Galley, Jennifer
    Taylor, Sarah
    Boz, Umit
    Ren, Xiaoai
    Wu, Jingsi
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 2868 - 2873
  • [47] Vocabulary Demands of Academic Spoken English Revisited: A Case of University Lectures and TED Presentations
    Trang, Nguyen Huynh
    Nguyen, Duyen Thi Bich
    Ha, Hung Tan
    SAGE OPEN, 2023, 13 (01):
  • [48] TEMPORAL VARIABLES IN LECTURES IN THE JAPANESE LANGUAGE
    Graduate Division of International and Interdisciplinary Studies, The University of Tokyo, Japan
    Int. Conf. Spok. Lang. Process., ICSLP, 1600,
  • [49] The Language of Lectures: Offsetting Challenging Words
    Medimorec, Srdan
    Schaffer, Kavita V.
    Pavlik, Philip I., Jr.
    Olney, Andrew
    Graesser, Arthur C.
    Risko, Evan F.
    CANADIAN JOURNAL OF EXPERIMENTAL PSYCHOLOGY-REVUE CANADIENNE DE PSYCHOLOGIE EXPERIMENTALE, 2014, 68 (04): : 257 - 257
  • [50] Improving the Transcription of Academic Lectures for Information Retrieval
    Mbogho, Audrey
    Marquard, Stephen
    2013 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2013), VOL 2, 2013, : 560 - 567