GlotLID: Language Identification for Low-Resource Languages

被引:0
|
作者
Kargaran, Amir Hossein [1 ,2 ]
Imani, Ayyoob [1 ,2 ]
Yvon, Francois [3 ]
Schuetze, Hinrich [1 ,2 ]
机构
[1] Ludwig Maximilians Univ Munchen, Ctr Informat & Language Proc, Munich, Germany
[2] Munich Ctr Machine Learning MCML, Munich, Germany
[3] Sorbonne Univ, CNRS, ISIR, Paris, France
基金
欧洲研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model, code, and list of data sources are available: https://github.com/cisnlp/GlotLID.
引用
收藏
页码:6155 / 6218
页数:64
相关论文
共 50 条
  • [21] Classifying educational materials in low-resource languages
    Sohsah, Gihad N.
    Guzey, Onur
    Tarmanini, Zaina
    2016 15TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2016), 2016, : 431 - 435
  • [22] Extending Multilingual BERT to Low-Resource Languages
    Wang, Zihan
    Karthikeyan, K.
    Mayhew, Stephen
    Roth, Dan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 2649 - 2656
  • [23] Attention is all low-resource languages need
    Poupard, Duncan
    TRANSLATION STUDIES, 2024, 17 (02) : 424 - 427
  • [24] Exploring ASR Models in Low-Resource Languages: Use-Case the Macedonian Language
    Bogdanoski, Konstantin
    Mishev, Kostadin
    Simjanoska, Monika
    Trajanov, Dimitar
    DEEP LEARNING THEORY AND APPLICATIONS, DELTA 2023, 2023, 1875 : 254 - 268
  • [25] LOW-RESOURCE LANGUAGE IDENTIFICATION FROM SPEECH USING TRANSFER LEARNING
    Feng, Kexin
    Chaspari, Theodora
    2019 IEEE 29TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2019,
  • [26] Cross-lingual offensive speech identification with transfer learning for low-resource languages
    Shi, Xiayang
    Liu, Xinyi
    Xu, Chun
    Huang, Yuanyuan
    Chen, Fang
    Zhu, Shaolin
    COMPUTERS & ELECTRICAL ENGINEERING, 2022, 101
  • [27] Bootstrapping Transliteration with Constrained Discovery for Low-Resource Languages
    Upadhyay, Shyam
    Kodner, Jordan
    Roth, Dan
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 501 - 511
  • [28] Detecting Social Media Manipulation in Low-Resource Languages
    Haider, Samar
    Luceri, Luca
    Deb, Ashok
    Badawy, Adam
    Peng, Nanyun
    Ferrara, Emilio
    COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023, 2023, : 1358 - 1364
  • [29] OCR Improves Machine Translation for Low-Resource Languages
    Ignat, Oana
    Maillard, Jean
    Chaudhary, Vishrav
    Guzman, Francisco
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1164 - 1174
  • [30] Low-Resource Languages Jailbreak GPT-4
    Yong, Zheng-Xin
    Menghini, Cristina
    Bach, Stephen H.
    arXiv, 2023,