IRSTLM: an Open Source Toolkit for Handling Large Scale Language Models

被引:0
|
作者
Federico, Marcello [1 ]
Bertoldi, Nicola [1 ]
Cettolo, Mauro [1 ]
机构
[1] FBK Irst Ric Sci & Tecnol, Povo, TN, Italy
关键词
Automatic Speech Recognition; Language Modeling; Statistical Machine Translation;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Research in speech recognition and machine translation is boosting the use of large scale n-gram language models. We present an open source toolkit that permits to efficiently handle language models with billions of n-grams on conventional machines. The IRSTLM toolkit supports distribution of n-gram collection and smoothing over a computer cluster, language model compression through probability quantization, lazy-loading of huge language models from disk. IRSTLM has been so far successfully deployed with the Moses toolkit for statistical machine translation and with the FBK-irst speech recognition system. Efficiency of the tool is reported on a speech transcription task of Italian political speeches using a language model of 1.1 billion four-grams.
引用
收藏
页码:1618 / 1621
页数:4
相关论文
共 50 条
  • [1] Servicing open-source large language models for oncology
    Ray, Partha Pratim
    ONCOLOGIST, 2024,
  • [2] nbodykit: An Open-source, Massively Parallel Toolkit for Large-scale Structure
    Hand, Nick
    Feng, Yu
    Beutler, Florian
    Li, Yin
    Modi, Chirag
    Seljak, Uros
    Slepian, Zachary
    ASTRONOMICAL JOURNAL, 2018, 156 (04):
  • [3] A tutorial on open-source large language models for behavioral science
    Hussain, Zak
    Binz, Marcel
    Mata, Rui
    Wulff, Dirk U.
    BEHAVIOR RESEARCH METHODS, 2024, : 8214 - 8237
  • [4] The (ab)use of Open Source Code to Train Large Language Models
    Al-Kaswan, Ali
    Izadi, Maliheh
    2023 IEEE/ACM 2ND INTERNATIONAL WORKSHOP ON NATURAL LANGUAGE-BASED SOFTWARE ENGINEERING, NLBSE, 2023, : 9 - 10
  • [5] nutIE - A modern open source natural language processing toolkit
    Zitnik, Slavko
    Draskovic, Drazen
    Nikolic, Bosko
    Bajec, Marko
    2017 25TH TELECOMMUNICATION FORUM (TELFOR), 2017, : 880 - 883
  • [6] Preliminary Systematic Review of Open-Source Large Language Models in Education
    Lin, Michael Pin-Chuan
    Chang, Daniel
    Hall, Sarah
    Jhajj, Gaganpreet
    GENERATIVE INTELLIGENCE AND INTELLIGENT TUTORING SYSTEMS, PT I, ITS 2024, 2024, 14798 : 68 - 77
  • [7] Defect handling in medium and large open source projects
    Koru, AG
    Tian, J
    IEEE SOFTWARE, 2004, 21 (04) : 54 - +
  • [8] Handling Language Variations in Open Source Bug Reporting Systems
    Banerjee, Sean
    Musgrove, Jesse
    Cukic, Bojan
    23RD IEEE INTERNATIONAL SYMPOSIUM ON SOFTWARE RELIABILITY ENGINEERING WORKSHOPS (ISSRE 2012), 2012, : 325 - 330
  • [9] An open source SCADA toolkit
    Klein, Stanley A.
    2006 POWER ENGINEERING SOCIETY GENERAL MEETING, VOLS 1-9, 2006, : 2257 - 2258
  • [10] Archivists toolkit to be open source
    不详
    LIBRARY JOURNAL, 2004, 129 (16) : 23 - 23