GlotLID: Language Identification for Low-Resource Languages

被引:0
|
作者
Kargaran, Amir Hossein [1 ,2 ]
Imani, Ayyoob [1 ,2 ]
Yvon, Francois [3 ]
Schuetze, Hinrich [1 ,2 ]
机构
[1] Ludwig Maximilians Univ Munchen, Ctr Informat & Language Proc, Munich, Germany
[2] Munich Ctr Machine Learning MCML, Munich, Germany
[3] Sorbonne Univ, CNRS, ISIR, Paris, France
基金
欧洲研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model, code, and list of data sources are available: https://github.com/cisnlp/GlotLID.
引用
收藏
页码:6155 / 6218
页数:64
相关论文
共 50 条
  • [41] An end-to-end framework for translation of American sign language to low-resource languages in Nigeria
    Dere, Mustapha Deji
    Dere, Roshidat Oluwabukola
    Adesina, Adewale
    Yauri, Aliyu Rufai
    SCIENTIFIC AFRICAN, 2023, 21
  • [42] A Survey on Challenges and Advances in Natural Language Processing with a Focus on Legal Informatics and Low-Resource Languages
    Krasadakis, Panteleimon
    Sakkopoulos, Evangelos
    Verykios, Vassilios S.
    ELECTRONICS, 2024, 13 (03)
  • [43] Fine-tuning large language models for improved health communication in low-resource languages
    Bui, Nhat
    Nguyen, Giang
    Nguyen, Nguyen
    Vo, Bao
    Vo, Luan
    Huynh, Tom
    Tang, Arthur
    Tran, Van Nhiem
    Huynh, Tuyen
    Nguyen, Huy Quang
    Dinh, Minh
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2025, 263
  • [44] A neural approach for inducing multilingual resources and natural language processing tools for low-resource languages
    Zennaki, O.
    Semmar, N.
    Besacier, L.
    NATURAL LANGUAGE ENGINEERING, 2019, 25 (01) : 43 - 67
  • [45] Efficient Entity Candidate Generation for Low-Resource Languages
    Garcia-Duran, Alberto
    Arora, Akhil
    West, Robert
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6429 - 6438
  • [46] Towards Cross-Corpora Generalization for Low-Resource Spoken Language Identification
    Dey, Spandan
    Sahidullah, Md
    Saha, Goutam
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 5040 - 5050
  • [47] Machine Translation into Low-resource Language Varieties
    Kumar, Sachin
    Anastasopoulos, Antonios
    Wintner, Shuly
    Tsvetkov, Yulia
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 110 - 121
  • [48] Character Profiling in Low-Resource Language Documents
    Wong, Tak-sum
    Lee, John
    ADCS 2019: PROCEEDINGS OF THE 24TH AUSTRALASIAN DOCUMENT COMPUTING SYMPOSIUM, 2019,
  • [49] Using Explainable AI (XAI) for Identification of Subjectivity in Hate Speech Annotations for Low-Resource Languages
    Sawant, Madhuri
    Qureshi, M. Atif
    Younus, Arjumand
    Caton, Simon
    PROCEEDINGS OF THE 2024 WORKSHOP ON OPEN CHALLENGES IN ONLINE SOCIAL NETWORKS, OASIS 2024, 2024, : 10 - 17
  • [50] How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages
    Bansal, Rachit
    Choudhary, Himanshu
    Punia, Ravneet
    Schenk, Niko
    Dahl, Jacob L.
    Page-Perron, Emilie
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2021, : 44 - 59