Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks

被引:0
|
作者
Nag, Arijit [1 ]
Samanta, Bidisha [1 ]
Mukherjee, Animesh [1 ]
Ganguly, Niloy [1 ]
Chakrabarti, Soumen [2 ]
机构
[1] IIT Kharagpur, Kharagpur, W Bengal, India
[2] Indian Inst Technol, Bombay, Maharashtra, India
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Multilingual language models (MLLMs) like mBERT promise to extend the benefits of NLP research to low-resource languages (LRLs). However, LRL words are under-represented in the wordpiece/subword vocabularies of MLLMs. This leads to many LRL words getting replaced by UNK, or concatenated from morphologically unrelated wordpieces, leading to low task accuracy. (Pre)-training MLLMs after including LRL documents is resource-intensive in terms of both human inputs and computational resources. In response, we propose EVALM (entropy-based vocabulary augmented language model), which uses a new task-cognizant measurement to detect the most vulnerable LRL words, whose wordpiece segmentations are undesirable. EVALM then provides reasonable initializations of their embeddings, followed by limited fine-tuning using the small LRL task corpus. Our experiments show significant performance improvements and also some surprising limits to such vocabulary augmentation strategies in various classification tasks for multiple diverse LRLs, as well as code-mixed texts. We will release the code and data to enable further research(1).
引用
收藏
页码:8619 / 8629
页数:11
相关论文
共 50 条
  • [1] Efficient Adaptation: Enhancing Multilingual Models for Low-Resource Language Translation
    Sel, Ilhami
    Hanbay, Davut
    MATHEMATICS, 2024, 12 (19)
  • [2] Multilingual Offensive Language Identification for Low-resource Languages
    Ranasinghe, Tharindu
    Zampieri, Marcos
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (01)
  • [3] Getting More Data for Low-resource Morphological Inflection: Language Models and Data Augmentation
    Sorokin, Alexey
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3978 - 3983
  • [4] Multilingual acoustic models for speech recognition in low-resource devices
    Garcia, Enrique Gil
    Mengusoglu, Erhan
    Janke, Eric
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 981 - +
  • [5] Lexicon-based fine-tuning of multilingual language models for low-resource language sentiment analysis
    Dhananjaya, Vinura
    Ranathunga, Surangika
    Jayasena, Sanath
    CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2024, 9 (05) : 1116 - 1125
  • [6] DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks
    Ding, Bosheng
    Liu, Linlin
    Bing, Lidong
    Kruengkrai, Canasai
    Nguyen, Thien Hai
    Joty, Shafiq
    Si, Luo
    Miao, Chunyan
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6045 - 6057
  • [7] Refining Low-Resource Unsupervised Translation by Language Disentanglement of Multilingual Model
    Nguyen, Xuan-Phi
    Joty, Shafiq
    Kui, Wu
    Aw, Ai Ti
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [8] Refining Low-Resource Unsupervised Translation by Language Disentanglement of Multilingual Model
    Nguyen, Xuan-Phi
    Joty, Shafiq
    Kui, Wu
    Aw, Ai Ti
    arXiv, 2022,
  • [9] Entropy-Guided Distributional Reinforcement Learning with Controlling Uncertainty in Robotic Tasks
    Cho, Hyunjin
    Kim, Hyunseok
    APPLIED SCIENCES-BASEL, 2025, 15 (05):
  • [10] adaptMLLM: Fine-Tuning Multilingual Language Models on Low-Resource Languages with Integrated LLM Playgrounds
    Lankford, Seamus
    Afli, Haithem
    Way, Andy
    INFORMATION, 2023, 14 (12)