How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages

被引:0
|
作者
Bansal, Rachit [1 ]
Choudhary, Himanshu [1 ]
Punia, Ravneet [1 ]
Schenk, Niko [2 ]
Dahl, Jacob L. [3 ]
Page-Perron, Emilie [3 ]
机构
[1] Delhi Technol Univ, Delhi, India
[2] Amazon Berlin, Berlin, Germany
[3] Univ Oxford, Oxford, England
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite the recent advancements of attention-based deep learning architectures across a majority of Natural Language Processing tasks, their application remains limited in a low-resource setting because of a lack of pre-trained models for such languages. In this study, we make the first attempt to investigate the challenges of adapting these techniques to an extremely low-resource language - Sumerian cuneiform - one of the world's oldest written language attested from at least the beginning of the 3rd millennium BC. Specifically, we introduce the first cross-lingual information extraction pipeline for Sumerian, which includes part-of-speech tagging, named entity recognition, and machine translation. We introduce InterpretLR, an interpretability toolkit for low-resource NLP and use it alongside human evaluations to gauge the trained models. Notably, all our techniques and most components of our pipeline can be generalised to any low-resource language. We publicly release all our implementations including a novel data set with domain-specific pre-processing to promote further research in this domain.
引用
收藏
页码:44 / 59
页数:16
相关论文
共 50 条
  • [1] Extremely low-resource neural machine translation for Asian languages
    Rubino, Raphael
    Marie, Benjamin
    Dabre, Raj
    Fujita, Atushi
    Utiyama, Masao
    Sumita, Eiichiro
    MACHINE TRANSLATION, 2020, 34 (04) : 347 - 382
  • [2] A Comparative Study of Extremely Low-Resource Transliteration of the World's Languages
    Wu, Winston
    Yarowsky, David
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 938 - 943
  • [3] Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages
    Downey, C. M.
    Drizin, Shannon
    Haroutunian, Levon
    Thukral, Shivin
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 5331 - 5346
  • [4] A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages
    Vania, Clara
    Kementchedjhieva, Yova
    Sogaard, Anders
    Lopez, Adam
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 1105 - 1116
  • [5] Voice Activation for Low-Resource Languages
    Kolesau, Aliaksei
    Sesok, Dmitrij
    APPLIED SCIENCES-BASEL, 2021, 11 (14):
  • [6] Enabling Medical Translation for Low-Resource Languages
    Musleh, Ahmad
    Durrani, Nadir
    Temnikova, Irina
    Nakov, Preslav
    Vogel, Stephan
    Alsaad, Osama
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, (CICLING 2016), PT II, 2018, 9624 : 3 - 16
  • [7] Discourse annotation guideline for low-resource languages
    Vargas, Francielle
    Schmeisser-Nieto, Wolfgang
    Rabinovich, Zohar
    Pardo, Thiago A. S.
    Benevenuto, Fabricio
    NATURAL LANGUAGE PROCESSING, 2025, 31 (02): : 700 - 743
  • [8] GlotLID: Language Identification for Low-Resource Languages
    Kargaran, Amir Hossein
    Imani, Ayyoob
    Yvon, Francois
    Schuetze, Hinrich
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 6155 - 6218
  • [9] Classifying educational materials in low-resource languages
    Sohsah, Gihad N.
    Guzey, Onur
    Tarmanini, Zaina
    2016 15TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2016), 2016, : 431 - 435
  • [10] Extending Multilingual BERT to Low-Resource Languages
    Wang, Zihan
    Karthikeyan, K.
    Mayhew, Stephen
    Roth, Dan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 2649 - 2656