GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages

被引:0
|
作者
Gaim, Fitsum [1 ]
Yang, Wonsuk [1 ]
Park, Jong C. [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Daejeon, South Korea
关键词
language-identification; low-resource; multilingual models; Amharic; Blin; Ge'ez; Tigre; Tigrinya;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Language identification is one of the fundamental tasks in natural language processing that is a prerequisite to data processing and numerous applications. Low-resourced languages with similar typologies are generally confused with each other in real-world applications such as machine translation, affecting the user's experience. In this work, we present a languageidentification dataset for five typologically and phylogenetically related low-resourced East African languages that use the Ge'ez script as a writing system; namely Amharic, Blin, Ge'ez, Tigre, and Tigrinya. The dataset is built automatically from selected data sources, but we also performed a manual evaluation to assess its quality. Our approach to constructing the dataset is cost-effective and applicable to other low-resource languages. We integrated the dataset into an existing language-identification tool and also fine-tuned several Transformer based language models, achieving very strong results in all cases. While the task of language identification is easy for the informed person, such datasets can make a difference in real-world deployments and also serve as part of a benchmark for language understanding in the target languages. The data and models are made available at https://github.com/fgaim/geezswitch.
引用
收藏
页码:6578 / 6584
页数:7
相关论文
共 50 条
  • [31] Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of Yoruba and Twi
    Alabi, Jesujoba O.
    Amponsah-Kaakyire, Kwabena
    Adelani, David, I
    Espana-Bonet, Cristina
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2754 - 2762
  • [32] Using Annotation Projection for Semantic Role Labeling of Low-Resourced Language: Sinhala
    Gunasekara, Sandun
    Chathura, Dulanjaya
    Jeewantha, Chamoda
    Dias, Gihan
    2020 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2020), 2020, : 98 - 103
  • [33] Question-Answering in a Low-resourced Language: Benchmark Dataset and Models for Tigrinya
    Gaim, Fitsum
    Yang, Wonsuk
    Park, Hancheol
    Park, Jong C.
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 11857 - 11870
  • [34] Language Model Data Augmentation for Keyword Spotting in Low-Resourced Training Conditions
    Gorin, Arseniy
    Lileikyte, Rasa
    Huang, Guangpu
    Lamel, Lori
    Gauvain, Jean-Luc
    Laurent, Antoine
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 775 - 779
  • [35] Toward the Development of Large-Scale Word Embedding for Low-Resourced Language
    Nazir, Shahzad
    Asif, Muhammad
    Sahi, Shahbaz Ahmad
    Ahmad, Shahbaz
    Ghadi, Yazeed Yasin
    Aziz, Muhammad Haris
    IEEE ACCESS, 2022, 10 : 54091 - 54097
  • [36] Case Study on Data Collection of Kreol Morisien, a Low-Resourced Creole Language
    Bastien, David Joshen
    Chumroo, Vijay Prakash
    Bastien, Johan Patrice
    2022 IST-AFRICA CONFERENCE, 2022,
  • [37] Leveraging Large Language Models in Low-resourced Language NLP: A spaCy Implementation for Modern Tibetan
    Kyogoku, Yuki
    Erhard, Franz Xaver
    Engels, James
    Barnett, Robert
    REVUE D ETUDES TIBETAINES, 2025, (74):
  • [38] Language Identification for Under-Resourced Languages in the Basque Context
    Barroso, Nora
    de Ipina, Karmele Lopez
    Grana, Manuel
    Ezeiza, Aitzol
    SOFT COMPUTING MODELS IN INDUSTRIAL AND ENVIRONMENTAL APPLICATIONS, 6TH INTERNATIONAL CONFERENCE SOCO 2011, 2011, 87 : 475 - 483
  • [39] Evaluation of Neural Network Transformer Models for Named-Entity Recognition on Low-Resourced Languages
    Hanslo, Ridewaan
    PROCEEDINGS OF THE 2021 16TH CONFERENCE ON COMPUTER SCIENCE AND INTELLIGENCE SYSTEMS (FEDCSIS), 2021, : 115 - 119
  • [40] INTENT RECOGNITION AND UNSUPERVISED SLOT IDENTIFICATION FOR LOW-RESOURCED SPOKEN DIALOG SYSTEMS
    Gupta, Akshat
    Deng, Olivia
    Kushwaha, Akruti
    Mittal, Saloni
    Zeng, William
    Rallabandi, Sai Krishna
    Black, Alan W.
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 853 - 860