Overcoming the sparseness problem of spoken language corpora using other large corpora of distinct characteristics

被引:0
|
作者
Cho, SY
Kim, SH
Park, J
Lee, YJ
机构
[1] MyongJi Univ, Dept Comp Sci, KyungGi, South Korea
[2] Elect & Telecommun Res Inst, Taejon, South Korea
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper proposes a method of combining two n-gram language models, one constructed from a very small corpus of the right domain of interest, the other constructed from a large but less adequate corpus, resulting in a significantly enhanced language model. This method is based on the observation that a small corpus from the right domain has high quality n-grams but has serious sparseness problem, while a large corpus from a different domain has more n-gram statistics but inadequately biased. Two n-gram models are combined by extending the idea of Katz's backoff. We ran experiments with 3-gram language models constructed from newspaper corpora of several million to tens of million words together with models from smaller broadcast news corpora. The target domain was broadcast news. We obtained significant improvement (30%) by incorporating a small corpus around one thirtieth size of the newspaper corpus..
引用
收藏
页码:407 / 411
页数:5
相关论文
共 50 条
  • [1] EXPLORING SPOKEN ENGLISH LEARNER LANGUAGE USING CORPORA
    Jin, Tan
    Shi, Zhan
    [J]. APPLIED LINGUISTICS, 2019, 40 (06) : 1009 - 1012
  • [2] Corpora of Spoken Spanish Language - The Representativeness Issue -
    Moreno-Fernandez, Francisco
    [J]. LINGUISTIC INFORMATICS - STATE OF THE ART AND THE FUTURE: THE FIRST INTERNATIONAL CONFERENCE ON LINGUISTIC INFORMATICS, 2005, 1 : 120 - 144
  • [3] Word clustering with parallel spoken language corpora
    Wang, YY
    Lafferty, J
    Waibel, A
    [J]. ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 2364 - 2367
  • [4] Automatic disambiguation of morphosyntax in spoken language corpora
    Christophe Parisse
    Marie-thérèse Le Normand
    [J]. Behavior Research Methods, Instruments, & Computers, 2000, 32 : 468 - 481
  • [5] Automatic disambiguation of morphosyntax in spoken language corpora
    Parisse, C
    Le Normand, MT
    [J]. BEHAVIOR RESEARCH METHODS INSTRUMENTS & COMPUTERS, 2000, 32 (03): : 468 - 481
  • [6] Balancing word lists in speech audiometry through large spoken language corpora
    Hammer, Annemiek
    Vaerenberg, Bart
    Kowalczyk, Wojtek
    ten Bosch, Louis
    Coene, Martine
    Govaerts, Paul
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 3580 - 3583
  • [7] Exploring Spoken English Learner Language Using Corpora: Learner Talk
    Hu, Yanfeng
    [J]. DISCOURSE STUDIES, 2019, 21 (01) : 104 - 106
  • [8] Exploring spoken English learner language using corpora: Learner talk
    Yuan, Xinhua
    [J]. LANGUAGE LEARNING & TECHNOLOGY, 2018, 22 (03): : 41 - 44
  • [9] Comparing syllable frequencies in corpora of written and spoken language
    Samlowski, Barbara
    Moebius, Bernd
    Wagner, Petra
    [J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 644 - +
  • [10] Accessing spoken language corpora: an overview of current approaches
    Batinic, Josip
    Frick, Elena
    Schmidt, Thomas
    [J]. CORPORA, 2021, 16 (03) : 417 - 445