GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages

被引:0
|
作者
Gaim, Fitsum [1 ]
Yang, Wonsuk [1 ]
Park, Jong C. [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Daejeon, South Korea
关键词
language-identification; low-resource; multilingual models; Amharic; Blin; Ge'ez; Tigre; Tigrinya;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Language identification is one of the fundamental tasks in natural language processing that is a prerequisite to data processing and numerous applications. Low-resourced languages with similar typologies are generally confused with each other in real-world applications such as machine translation, affecting the user's experience. In this work, we present a languageidentification dataset for five typologically and phylogenetically related low-resourced East African languages that use the Ge'ez script as a writing system; namely Amharic, Blin, Ge'ez, Tigre, and Tigrinya. The dataset is built automatically from selected data sources, but we also performed a manual evaluation to assess its quality. Our approach to constructing the dataset is cost-effective and applicable to other low-resource languages. We integrated the dataset into an existing language-identification tool and also fine-tuned several Transformer based language models, achieving very strong results in all cases. While the task of language identification is easy for the informed person, such datasets can make a difference in real-world deployments and also serve as part of a benchmark for language understanding in the target languages. The data and models are made available at https://github.com/fgaim/geezswitch.
引用
收藏
页码:6578 / 6584
页数:7
相关论文
共 50 条
  • [41] Text Classification of News Articles Using Machine Learning on Low-resourced Language: Tigrigna
    Fesseha, Awet
    Xiong, Shengwu
    Emiru, Eshete Derb
    Dahou, Abdelghani
    2020 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA (ICAIBD 2020), 2020, : 34 - 38
  • [42] Analysis of Automatic Evaluation Metric on Low-Resourced Language: BERTScore vs BLEU Score
    Datta, Goutam
    Joshi, Nisheeth
    Gupta, Kusum
    SPEECH AND COMPUTER, SPECOM 2022, 2022, 13721 : 155 - 162
  • [43] END-TO-END CODE-SWITCHING ASR FOR LOW-RESOURCED LANGUAGE PAIRS
    Yue, Xianghu
    Lee, Grandee
    Yilmaz, Emre
    Deng, Fang
    Li, Haizhou
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 972 - 979
  • [44] Improving Domain-specific SMT for Low-resourced Languages using Data from Different Domains
    Farhath, Fathima
    Theivendiram, Pranavan
    Ranathunga, Surangika
    Jayasena, Sanath
    Dias, Gihan
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3789 - 3794
  • [45] Word Sense Disambiguation for Morphologically Rich Low-Resourced Languages: A Systematic Literature Review and Meta-Analysis
    Masethe, Hlaudi Daniel
    Masethe, Mosima Anna
    Ojo, Sunday Olusegun
    Giunchiglia, Fausto
    Owolawi, Pius Adewale
    INFORMATION, 2024, 15 (09)
  • [46] Uniform Multilingual Multi-Speaker Acoustic Model for Statistical Parametric Speech Synthesis of Low-Resourced Languages
    Gutkin, Alexander
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2183 - 2187
  • [47] Enabling Spoken Dialogue Systems for Low-Resourced Languages-End-to-End Dialect Recognition for North Sami
    Trung Ngo Trong
    Jokinen, Kristiina
    Hautamaki, Ville
    9TH INTERNATIONAL WORKSHOP ON SPOKEN DIALOGUE SYSTEM TECHNOLOGY, 2019, 579 : 221 - 235
  • [48] Deep Learning Transformer Architecture for Named-Entity Recognition on Low-Resourced Languages: State of the art results
    Hanslo, Ridewaan
    PROCEEDINGS OF THE 2022 17TH CONFERENCE ON COMPUTER SCIENCE AND INTELLIGENCE SYSTEMS (FEDCSIS), 2022, : 53 - 60
  • [49] Patterns of Community Violence Exposure among African American Adolescents Living in Low-Resourced Urban Neighborhoods
    Whipple, Christopher R.
    Robinson, Willie LaVome
    Jason, Leonard A.
    Kaynak, Ovgu
    Harris, Chelsea W.
    Grisamore, Simone P.
    Troyka, Melinda N.
    AMERICAN JOURNAL OF COMMUNITY PSYCHOLOGY, 2021, 68 (3-4) : 414 - 426
  • [50] Word-length algorithm for language identification of under-resourced languages
    Selamat, Ali
    Akosu, Nicholas
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2016, 28 (04) : 457 - 469