GlotLID: Language Identification for Low-Resource Languages

被引:0
|
作者
Kargaran, Amir Hossein [1 ,2 ]
Imani, Ayyoob [1 ,2 ]
Yvon, Francois [3 ]
Schuetze, Hinrich [1 ,2 ]
机构
[1] Ludwig Maximilians Univ Munchen, Ctr Informat & Language Proc, Munich, Germany
[2] Munich Ctr Machine Learning MCML, Munich, Germany
[3] Sorbonne Univ, CNRS, ISIR, Paris, France
基金
欧洲研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model, code, and list of data sources are available: https://github.com/cisnlp/GlotLID.
引用
收藏
页码:6155 / 6218
页数:64
相关论文
共 50 条
  • [1] Multilingual Offensive Language Identification for Low-resource Languages
    Ranasinghe, Tharindu
    Zampieri, Marcos
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (01)
  • [2] A Study on Low-resource Language Identification
    Qi, Zhaodi
    Ma, Yong
    Gu, Mingliang
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1897 - 1902
  • [3] Loanword Identification in Low-Resource Languages with Minimal Supervision
    Mi, Chenggang
    Xie, Lei
    Zhang, Yanning
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (03)
  • [4] Towards Language Service Creation and Customization for Low-Resource Languages
    Lin, Donghui
    Murakami, Yohei
    Ishida, Toru
    INFORMATION, 2020, 11 (02)
  • [5] Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification
    Joyanta Basu
    Soma Khan
    Rajib Roy
    Tapan Kumar Basu
    Swanirbhar Majumder
    Circuits, Systems, and Signal Processing, 2021, 40 : 4986 - 5013
  • [6] Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification
    Basu, Joyanta
    Khan, Soma
    Roy, Rajib
    Basu, Tapan Kumar
    Majumder, Swanirbhar
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2021, 40 (10) : 4986 - 5013
  • [7] Large Language Models and Low-Resource Languages: An Examination of Armenian NLP
    Avetisyan, Hayastan
    Broneske, David
    13TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING AND THE 3RD CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, IJCNLP-AACL 2023, 2023, : 199 - 210
  • [8] Fine Tuning Language Models: A Tale of Two Low-Resource Languages
    Rosel OidaOnesa
    Melvin ABallera
    Data Intelligence, 2024, 6 (04) : 946 - 967
  • [10] Enhancing African low-resource languages: Swahili data for language modelling
    Shikali, Casper S.
    Mokhosi, Refuoe
    DATA IN BRIEF, 2020, 31