GlotLID: Language Identification for Low-Resource Languages

被引：0

作者：

Kargaran, Amir Hossein ^{[1
,2
]}

Imani, Ayyoob ^{[1
,2
]}

Yvon, Francois ^{[3
]}

Schuetze, Hinrich ^{[1
,2
]}

机构：

[1] Ludwig Maximilians Univ Munchen, Ctr Informat & Language Proc, Munich, Germany

[2] Munich Ctr Machine Learning MCML, Munich, Germany

[3] Sorbonne Univ, CNRS, ISIR, Paris, France

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023 | 2023年

基金：

欧洲研究理事会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model, code, and list of data sources are available: https://github.com/cisnlp/GlotLID.

引用

页码：6155 / 6218

页数：64

共 50 条

[1] Multilingual Offensive Language Identification for Low-resource Languages
Ranasinghe, Tharindu
Zampieri, Marcos
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (01)
[2] A Study on Low-resource Language Identification
Qi, Zhaodi
Ma, Yong
Gu, Mingliang
2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1897 - 1902
[3] Loanword Identification in Low-Resource Languages with Minimal Supervision
Mi, Chenggang
Xie, Lei
Zhang, Yanning
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (03)
[4] Towards Language Service Creation and Customization for Low-Resource Languages
Lin, Donghui
Murakami, Yohei
Ishida, Toru
INFORMATION, 2020, 11 (02)
[5] Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification
Joyanta Basu
Soma Khan
Rajib Roy
Tapan Kumar Basu
Swanirbhar Majumder
Circuits, Systems, and Signal Processing, 2021, 40 : 4986 - 5013
[6] Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification
Basu, Joyanta
Khan, Soma
Roy, Rajib
Basu, Tapan Kumar
Majumder, Swanirbhar
CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2021, 40 (10) : 4986 - 5013
[7] Large Language Models and Low-Resource Languages: An Examination of Armenian NLP
Avetisyan, Hayastan
Broneske, David
13TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING AND THE 3RD CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, IJCNLP-AACL 2023, 2023, : 199 - 210
[8] Fine Tuning Language Models: A Tale of Two Low-Resource Languages
Rosel OidaOnesa
Melvin ABallera
Data Intelligence, 2024, 6 (04) : 946 - 967
[9] A Lemmatizer for Low-resource Languages: WSD and Its Role in the Assamese Language
Gogoi, Arjun
Baruah, Nomi
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (04)
[10] Enhancing African low-resource languages: Swahili data for language modelling
Shikali, Casper S.
Mokhosi, Refuoe
DATA IN BRIEF, 2020, 31

← 1 2 3 4 5 →