GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages

被引:0
|
作者
Gaim, Fitsum [1 ]
Yang, Wonsuk [1 ]
Park, Jong C. [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Daejeon, South Korea
关键词
language-identification; low-resource; multilingual models; Amharic; Blin; Ge'ez; Tigre; Tigrinya;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Language identification is one of the fundamental tasks in natural language processing that is a prerequisite to data processing and numerous applications. Low-resourced languages with similar typologies are generally confused with each other in real-world applications such as machine translation, affecting the user's experience. In this work, we present a languageidentification dataset for five typologically and phylogenetically related low-resourced East African languages that use the Ge'ez script as a writing system; namely Amharic, Blin, Ge'ez, Tigre, and Tigrinya. The dataset is built automatically from selected data sources, but we also performed a manual evaluation to assess its quality. Our approach to constructing the dataset is cost-effective and applicable to other low-resource languages. We integrated the dataset into an existing language-identification tool and also fine-tuned several Transformer based language models, achieving very strong results in all cases. While the task of language identification is easy for the informed person, such datasets can make a difference in real-world deployments and also serve as part of a benchmark for language understanding in the target languages. The data and models are made available at https://github.com/fgaim/geezswitch.
引用
收藏
页码:6578 / 6584
页数:7
相关论文
共 50 条
  • [21] A Linguistics-Driven Approach to Statistical Parsing for Low-Resourced Languages
    Boonkwan, Prachya
    Supnithi, Thepchai
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2015, E98D (05): : 1045 - 1052
  • [22] A Need Finding Study with Low-Resourced Language Content Creators
    Nigatu, Hellina Hailu
    Canny, John
    Chasins, Sarah
    PROCEEDINGS OF THE 4TH AFRICAN CONFERENCE FOR HUMAN COMPUTER INTERACTION, AFRICHI 2023, 2023, : 1 - 4
  • [23] A First LVCSR System for Luxembourgish, a Low-Resourced European Language
    Adda-Decker, Martine
    Lamel, Lori
    Adda, Gilles
    Lavergne, Thomas
    HUMAN LANGUAGE TECHNOLOGY CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS, 2014, 8387 : 479 - 490
  • [24] Topic and Keyword Identification for Low-resourced Speech Using Cross-Language Transfer Learning
    Chen, Wenda
    Hasegawa-Johnson, Mark
    Chen, Nancy F.
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2047 - 2051
  • [25] Low-resourced Languages and Online Knowledge Repositories: A Need-Finding Study
    Nigatu, Hellina Hailu
    Canny, John
    Chasins, Sarah E.
    PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS (CHI 2024), 2024,
  • [26] Improving the Performance of Low-resourced Speaker Identification with Data Preprocessing
    Phyu, Win Lai Lai
    Naing, Hay Mar Soe
    Pa, Win Pa
    JOURNAL OF ICT RESEARCH AND APPLICATIONS, 2023, 17 (03) : 275 - 291
  • [27] ASR FOR LOW-RESOURCED LANGUAGES: BUILDING A PHONETICALLY BALANCED ROMANIAN SPEECH CORPUS
    Stanescu , Miruna
    Cucu, Horia
    Buzo, Andi
    Burileanu, Corneliu
    2012 PROCEEDINGS OF THE 20TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2012, : 2060 - 2064
  • [28] Use of Word and Character N-Grams for Low-Resourced Local Languages
    Regalado, Ralph Vincent
    Agarap, Abien Fred
    Baliber, Renz Iver
    Yambao, Arian
    Cheng, Charibeth
    2018 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2018, : 250 - 254
  • [29] AN INVESTIGATION INTO LANGUAGE MODEL DATA AUGMENTATION FOR LOW-RESOURCED STT AND KWS
    Huang, Guangpu
    da Silva, Thiago Fraga
    Lamel, Lori
    Gauvain, Jean-Luc
    Gorin, Arseniy
    Laurent, Antoine
    Lileikyte, Rasa
    Messouadi, Abdel
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5790 - 5794
  • [30] Weakly-supervised Deep Cognate Detection Framework for Low-Resourced Languages Using Morphological Knowledge of Closely-Related Languages
    Goswami, Koustava
    Rani, Priya
    Fransen, Theodorus
    McCrae, John P.
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 531 - 541