GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages

被引:0
|
作者
Gaim, Fitsum [1 ]
Yang, Wonsuk [1 ]
Park, Jong C. [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Daejeon, South Korea
关键词
language-identification; low-resource; multilingual models; Amharic; Blin; Ge'ez; Tigre; Tigrinya;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Language identification is one of the fundamental tasks in natural language processing that is a prerequisite to data processing and numerous applications. Low-resourced languages with similar typologies are generally confused with each other in real-world applications such as machine translation, affecting the user's experience. In this work, we present a languageidentification dataset for five typologically and phylogenetically related low-resourced East African languages that use the Ge'ez script as a writing system; namely Amharic, Blin, Ge'ez, Tigre, and Tigrinya. The dataset is built automatically from selected data sources, but we also performed a manual evaluation to assess its quality. Our approach to constructing the dataset is cost-effective and applicable to other low-resource languages. We integrated the dataset into an existing language-identification tool and also fine-tuned several Transformer based language models, achieving very strong results in all cases. While the task of language identification is easy for the informed person, such datasets can make a difference in real-world deployments and also serve as part of a benchmark for language understanding in the target languages. The data and models are made available at https://github.com/fgaim/geezswitch.
引用
收藏
页码:6578 / 6584
页数:7
相关论文
共 50 条
  • [1] Surface Realization Architecture for Low-resourced African Languages
    Mahlaza, Zola
    Keet, C. Maria
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (03)
  • [2] Transformer-based Machine Translation for Low-resourced Languages embedded with Language Identification
    Sefara, Tshephisho J.
    Zwane, Skhumbuzo G.
    Gama, Nelisiwe
    Sibisi, Hlawulani
    Senoamadi, Phillemon N.
    Marivate, Vukosi
    2021 CONFERENCE ON INFORMATION COMMUNICATIONS TECHNOLOGY AND SOCIETY (ICTAS), 2021, : 127 - 132
  • [3] ASR DOMAIN ADAPTATION METHODS FOR LOW-RESOURCED LANGUAGES: APPLICATION TO ROMANIAN LANGUAGE
    Cucu, Horia
    Besacier, Laurent
    Burileanu, Corneliu
    Buzo, Andi
    2012 PROCEEDINGS OF THE 20TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2012, : 1648 - 1652
  • [4] Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages
    Nekoto, Wilhelmina
    Marivate, Vukosi
    Matsila, Tshinondiwa
    Fasubaa, Timi
    Kolawole, Tajudeen
    Fagbohungbe, Taiwo
    Akinola, Solomon Oluwole
    Muhammad, Shamsuddee Hassan
    Kabongo, Salomon
    Osei, Salomey
    Freshia, Sackey
    Niyongabo, Rubungo Andre
    Macharm, Ricky
    Ogayo, Perez
    Ahia, Orevaoghene
    Meressa, Musie
    Adeyemi, Mofe
    Mokgesi-Selinga, Masabata
    Okegbemi, Lawrence
    Martinus, Laura Jane
    Tajudeen, Kolawole
    Degila, Kevin
    Ogueji, Kelechi
    Siminyu, Kathleen
    Kreutzer, Julia
    Webster, Jason
    Ali, Jamiil Toure
    Abbott, Jade
    Orife, Iroro
    Ezeani, Ignatius
    Dangana, Idris Abdulkabir
    Kamper, Herman
    Elsahar, Hady
    Duru, Goodness
    Kioko, Ghollah
    Murhabazi, Espoir
    van Biljon, Elan
    Whitenack, Daniel
    Onyefuluchi, Christopher
    Emezue, Chris
    Dossou, Bonaventure
    Sibanda, Blessing
    Bassey, Blessing Itoro
    Olabiyi, Ayodele
    Ramkilowan, Arshath
    Oktem, Alp
    Akinfaderin, Adewale
    Bashir, Abdallah
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 2144 - 2160
  • [5] An Automatic Summarizer for a Low-Resourced Language
    Pattnaik, Sagarika
    Nayak, Ajit Kumar
    ADVANCED COMPUTING AND INTELLIGENT ENGINEERING, 2020, 1082 : 285 - 295
  • [6] Common latent representation learning for low-resourced spoken language identification
    Chen, Chen
    Bu, Yulin
    Chen, Yong
    Chen, Deyun
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (12) : 34515 - 34535
  • [7] Common latent representation learning for low-resourced spoken language identification
    Chen Chen
    Yulin Bu
    Yong Chen
    Deyun Chen
    Multimedia Tools and Applications, 2024, 83 : 34515 - 34535
  • [8] Common latent representation learning for low-resourced spoken language identification
    Chen, Chen
    Bu, Yulin
    Chen, Yong
    Chen, Deyun
    Multimedia Tools and Applications, 2024, 83 (12) : 34515 - 34535
  • [9] Multilingual Neural Semantic Parsing for Low-Resourced Languages
    Xia, Menglin
    Monti, Emilio
    10TH CONFERENCE ON LEXICAL AND COMPUTATIONAL SEMANTICS (SEM 2021), 2021, : 185 - 194
  • [10] Acoustic Modeling with Bootstrap and Restructuring for Low-resourced Languages
    Cui, Xiaodong
    Xue, Jian
    Dognin, Pierre L.
    Chaudhari, Upendra V.
    Zhou, Bowen
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2974 - 2977