GlotLID: Language Identification for Low-Resource Languages

被引：0

作者：

Kargaran, Amir Hossein ^{[1
,2
]}

Imani, Ayyoob ^{[1
,2
]}

Yvon, Francois ^{[3
]}

Schuetze, Hinrich ^{[1
,2
]}

机构：

[1] Ludwig Maximilians Univ Munchen, Ctr Informat & Language Proc, Munich, Germany

[2] Munich Ctr Machine Learning MCML, Munich, Germany

[3] Sorbonne Univ, CNRS, ISIR, Paris, France

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023 | 2023年

基金：

欧洲研究理事会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model, code, and list of data sources are available: https://github.com/cisnlp/GlotLID.

引用

页码：6155 / 6218

页数：64

共 50 条

[31] IMPROVING CAPTIONING FOR LOW-RESOURCE LANGUAGES BY CYCLE CONSISTENCY
Wu, Yike
Zhao, Shiwan
Chen, Jia
Zhang, Ying
Yuan, Xiaojie
Su, Zhong
2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 362 - 367
[32] Reduce and Reconstruct: ASR for Low-Resource Phonetic Languages
Diwan, Anuj
Jyothi, Preethi
INTERSPEECH 2021, 2021, : 3445 - 3449
[33] Special Issue: NLP in Low-Resource Languages Preface
Soboroff, Ian
Tong, Audrey
MACHINE TRANSLATION, 2018, 32 (1-2) : 1 - 2
[34] AUTOMATIC RATING OF SPONTANEOUS SPEECH FOR LOW-RESOURCE LANGUAGES
Al-Ghezi, Ragheb
Getman, Yaroslav
Voskoboinik, Ekaterina
Singh, Mittul
Kurimo, Mikko
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 339 - 345
[35] Speech recognition datasets for low-resource Congolese languages
Kimanuka, Ussen
Maina, Ciira wa
Buyuk, Osman
DATA IN BRIEF, 2024, 52
[36] Unsupervised Morphological Segmentation for Low-Resource Polysynthetic Languages
Eskander, Ramy
Klavans, Judith L.
Muresan, Smaranda
16TH SIGMORPHON WORKSHOP ON COMPUTATIONAL RESEARCH IN PHONETICS PHONOLOGY, AND MORPHOLOGY (SIGMORPHON 2019), 2019, : 189 - 195
[37] LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages
Kholodna, Nataliia
Julka, Sahib
Khodadadi, Mohammad
Gumus, Muhammed Nurullah
Granitzer, Michael
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES-APPLIED DATA SCIENCE TRACK, PT X, ECML PKDD 2024, 2024, 14950 : 397 - 412
[38] Optimizing Multilingual Sentiment Analysis in Low-Resource Languages with Adaptive Pretraining and Strategic Language Selection
Raychawdhary, Nilanjana
Das, Amit
Bhattacharya, Sutanu
Dozier, Gerry
Seals, Cheryl D.
2024 IEEE 3RD INTERNATIONAL CONFERENCE ON COMPUTING AND MACHINE INTELLIGENCE, ICMI 2024, 2024,
[39] Neural Machine Translation for Low-resource Languages: A Survey
Ranathunga, Surangika
Lee, En-Shiun Annie
Skenduli, Marjana Prifti
Shekhar, Ravi
Alam, Mehreen
Kaur, Rishemjit
ACM COMPUTING SURVEYS, 2023, 55 (11)
[40] Knowledge Transfer for Utterance Classification in Low-Resource Languages
Smirnov, Andrei
Mendelev, Valentin
SPEECH AND COMPUTER, 2016, 9811 : 435 - 442

← 1 2 3 4 5 →