Training statistical language models from grammar-generated data: A comparative case-study

被引:0
|
作者
Hockey, Beth Ann [1 ]
Rayner, Manny [2 ]
Christian, Gwen [3 ]
机构
[1] NASA, Ames Res Ctr, UCSC UARC, Mail Stop 19-26, Moffett Field, CA 94035 USA
[2] Univ Geneva, TIM ISSCO, CH-1211 Geneva, Switzerland
[3] Univ Calif Santa Cruz, Dept Linguist, Santa Cruz, CA 95064 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Statistical language models (SLMs) for speech recognition have the advantage of robustness, and grammar-based models (GLMs) the advantage that they can be built even when little corpus data is available. A known way to attempt to combine these two methodologies is first to create a GLM, and then use that GLM to generate training data for an SLM. It has however been difficult to evaluate the true utility of the idea, since the corpus data used to create the GLM has not in general been explicitly available. We exploit the Open Source Regulus platform, which supports corpus-based construction of linguistically motivated GLMs, to perform a methodologically sound comparison: the same data is used both to create an SLM directly, and also to create a GLM, which is then used to generate data to train an SLM. An evaluation on a medium-vocabulary task showed that the indirect method of constructing the SLM is in fact only marginally better than the direct one. The method used to create the training data is critical, with PCFG generation heavily outscoring CFG generation.
引用
收藏
页码:193 / +
页数:3
相关论文
共 50 条
  • [1] ECONOMIC EFFECTS OF LANGUAGE TRAINING TO IMMIGRANTS - CASE-STUDY
    WEIERMAIR, K
    [J]. INTERNATIONAL MIGRATION REVIEW, 1976, 10 (02) : 205 - 219
  • [2] COMPARATIVE STUDY OF ARABIC AND FRENCH STATISTICAL LANGUAGE MODELS
    Meftouh, Karima
    Smaili, Kamel
    Laskri, Mohamed Tayeb
    [J]. ICAART 2009: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE, 2009, : 156 - +
  • [3] MODELS FOR EXTENDING STREAMFLOW DATA - A CASE-STUDY
    RAMAN, H
    MOHAN, S
    PADALINATHAN, P
    [J]. HYDROLOGICAL SCIENCES JOURNAL-JOURNAL DES SCIENCES HYDROLOGIQUES, 1995, 40 (03): : 381 - 393
  • [4] TELECOMMUNICATION STATISTICAL-DATA BASE - A CASE-STUDY ON MICROCOMPUTERS
    DANG, VB
    MINGES, M
    [J]. TELECOMMUNICATION JOURNAL, 1989, 56 (11): : 713 - 717
  • [5] Extracting Training Data from Large Language Models
    Carlini, Nicholas
    Tramer, Florian
    Wallace, Eric
    Jagielski, Matthew
    Herbert-Voss, Ariel
    Lee, Katherine
    Roberts, Adam
    Brown, Tom
    Song, Dawn
    Erlingsson, Ulfar
    Oprea, Alina
    Raffel, Colin
    [J]. PROCEEDINGS OF THE 30TH USENIX SECURITY SYMPOSIUM, 2021, : 2633 - 2650
  • [7] Transfer of training: a case-study of outsourced training for staff from Bhutan
    Sofo, Francesco
    [J]. INTERNATIONAL JOURNAL OF TRAINING AND DEVELOPMENT, 2007, 11 (02) : 103 - 120
  • [8] Models of data and theoretical hypotheses: a case-study in classical genetics
    Marion Vorms
    [J]. Synthese, 2013, 190 : 293 - 319
  • [9] Models of data and theoretical hypotheses: a case-study in classical genetics
    Vorms, Marion
    [J]. SYNTHESE, 2013, 190 (02) : 293 - 319
  • [10] STATISTICAL MALPRACTICE IN DRUG PROMOTION - A CASE-STUDY FROM BRAZIL
    VICTORA, CG
    [J]. SOCIAL SCIENCE & MEDICINE, 1982, 16 (06) : 707 - 709