GlotLID: Language Identification for Low-Resource Languages

被引:0
|
作者
Kargaran, Amir Hossein [1 ,2 ]
Imani, Ayyoob [1 ,2 ]
Yvon, Francois [3 ]
Schuetze, Hinrich [1 ,2 ]
机构
[1] Ludwig Maximilians Univ Munchen, Ctr Informat & Language Proc, Munich, Germany
[2] Munich Ctr Machine Learning MCML, Munich, Germany
[3] Sorbonne Univ, CNRS, ISIR, Paris, France
基金
欧洲研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model, code, and list of data sources are available: https://github.com/cisnlp/GlotLID.
引用
收藏
页码:6155 / 6218
页数:64
相关论文
共 50 条
  • [31] IMPROVING CAPTIONING FOR LOW-RESOURCE LANGUAGES BY CYCLE CONSISTENCY
    Wu, Yike
    Zhao, Shiwan
    Chen, Jia
    Zhang, Ying
    Yuan, Xiaojie
    Su, Zhong
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 362 - 367
  • [32] Reduce and Reconstruct: ASR for Low-Resource Phonetic Languages
    Diwan, Anuj
    Jyothi, Preethi
    INTERSPEECH 2021, 2021, : 3445 - 3449
  • [33] Special Issue: NLP in Low-Resource Languages Preface
    Soboroff, Ian
    Tong, Audrey
    MACHINE TRANSLATION, 2018, 32 (1-2) : 1 - 2
  • [34] AUTOMATIC RATING OF SPONTANEOUS SPEECH FOR LOW-RESOURCE LANGUAGES
    Al-Ghezi, Ragheb
    Getman, Yaroslav
    Voskoboinik, Ekaterina
    Singh, Mittul
    Kurimo, Mikko
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 339 - 345
  • [35] Speech recognition datasets for low-resource Congolese languages
    Kimanuka, Ussen
    Maina, Ciira wa
    Buyuk, Osman
    DATA IN BRIEF, 2024, 52
  • [36] Unsupervised Morphological Segmentation for Low-Resource Polysynthetic Languages
    Eskander, Ramy
    Klavans, Judith L.
    Muresan, Smaranda
    16TH SIGMORPHON WORKSHOP ON COMPUTATIONAL RESEARCH IN PHONETICS PHONOLOGY, AND MORPHOLOGY (SIGMORPHON 2019), 2019, : 189 - 195
  • [37] LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages
    Kholodna, Nataliia
    Julka, Sahib
    Khodadadi, Mohammad
    Gumus, Muhammed Nurullah
    Granitzer, Michael
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES-APPLIED DATA SCIENCE TRACK, PT X, ECML PKDD 2024, 2024, 14950 : 397 - 412
  • [38] Optimizing Multilingual Sentiment Analysis in Low-Resource Languages with Adaptive Pretraining and Strategic Language Selection
    Raychawdhary, Nilanjana
    Das, Amit
    Bhattacharya, Sutanu
    Dozier, Gerry
    Seals, Cheryl D.
    2024 IEEE 3RD INTERNATIONAL CONFERENCE ON COMPUTING AND MACHINE INTELLIGENCE, ICMI 2024, 2024,
  • [39] Neural Machine Translation for Low-resource Languages: A Survey
    Ranathunga, Surangika
    Lee, En-Shiun Annie
    Skenduli, Marjana Prifti
    Shekhar, Ravi
    Alam, Mehreen
    Kaur, Rishemjit
    ACM COMPUTING SURVEYS, 2023, 55 (11)
  • [40] Knowledge Transfer for Utterance Classification in Low-Resource Languages
    Smirnov, Andrei
    Mendelev, Valentin
    SPEECH AND COMPUTER, 2016, 9811 : 435 - 442