Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks

被引:0
|
作者
Nag, Arijit [1 ]
Samanta, Bidisha [1 ]
Mukherjee, Animesh [1 ]
Ganguly, Niloy [1 ]
Chakrabarti, Soumen [2 ]
机构
[1] IIT Kharagpur, Kharagpur, W Bengal, India
[2] Indian Inst Technol, Bombay, Maharashtra, India
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Multilingual language models (MLLMs) like mBERT promise to extend the benefits of NLP research to low-resource languages (LRLs). However, LRL words are under-represented in the wordpiece/subword vocabularies of MLLMs. This leads to many LRL words getting replaced by UNK, or concatenated from morphologically unrelated wordpieces, leading to low task accuracy. (Pre)-training MLLMs after including LRL documents is resource-intensive in terms of both human inputs and computational resources. In response, we propose EVALM (entropy-based vocabulary augmented language model), which uses a new task-cognizant measurement to detect the most vulnerable LRL words, whose wordpiece segmentations are undesirable. EVALM then provides reasonable initializations of their embeddings, followed by limited fine-tuning using the small LRL task corpus. Our experiments show significant performance improvements and also some surprising limits to such vocabulary augmentation strategies in various classification tasks for multiple diverse LRLs, as well as code-mixed texts. We will release the code and data to enable further research(1).
引用
收藏
页码:8619 / 8629
页数:11
相关论文
共 50 条
  • [21] Exploring Large Language Models for Low-Resource IT Information Extraction
    Bhavya, Bhavya
    Isaza, Paulina Toro
    Deng, Yu
    Nidd, Michael
    Azad, Amar Prakash
    Shwartz, Larisa
    Zhai, ChengXiang
    2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 1203 - 1212
  • [22] AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages
    Ebrahimi, Abteen
    Mager, Manuel
    Oncevay, Arturo
    Chaudhary, Vishrav
    Chiruzzo, Luis
    Fan, Angela
    Ortega, John E.
    Ramos, Ricardo
    Rios, Annette
    Meza-Ruiz, Ivan
    Gimenez-Lugo, Gustavo A.
    Mager, Elisabeth
    Neubig, Graham
    Palmer, Alexis
    Coto-Solano, Rolando
    Ngoc Thang Vu
    Kann, Katharina
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 6279 - 6299
  • [23] MULTILINGUAL MLP FEATURES FOR LOW-RESOURCE LVCSR SYSTEMS
    Thomas, Samuel
    Ganapathy, Sriram
    Hermansky, Hynek
    2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4269 - 4272
  • [24] Generalized Data Augmentation for Low-Resource Translation
    Xia, Mengzhou
    Kong, Xiang
    Anastasopoulos, Antonios
    Neubig, Graham
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 5786 - 5796
  • [25] ADVERSARIAL MULTILINGUAL TRAINING FOR LOW-RESOURCE SPEECH RECOGNITION
    Yi, Jiangyan
    Tao, Jianhua
    Wen, Zhengqi
    Bai, Ye
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4899 - 4903
  • [26] Transfer Learning for Low-Resource Multilingual Relation Classification
    Nag, Arijit
    Samanta, Bidisha
    Mukherjee, Animesh
    Ganguly, Niloy
    Chakrabarti, Soumen
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (02)
  • [27] Data Augmentation for Low-Resource Keyphrase Generation
    Garg, Krishna
    Chowdhury, Jishnu Ray
    Caragea, Cornelia
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 8442 - 8455
  • [28] Non-Linear Pairwise Language Mappings for Low-Resource Multilingual Acoustic Model Fusion
    Farooq, Muhammad Umar
    Narayana, Darshan Adiga Haniya
    Hain, Thomas
    INTERSPEECH 2022, 2022, : 4850 - 4854
  • [29] Optimizing Multilingual Sentiment Analysis in Low-Resource Languages with Adaptive Pretraining and Strategic Language Selection
    Raychawdhary, Nilanjana
    Das, Amit
    Bhattacharya, Sutanu
    Dozier, Gerry
    Seals, Cheryl D.
    2024 IEEE 3RD INTERNATIONAL CONFERENCE ON COMPUTING AND MACHINE INTELLIGENCE, ICMI 2024, 2024,
  • [30] Adapting multilingual vision language transformers for low-resource Urdu optical character recognition (OCR)
    Cheema, Musa Dildar Ahmed
    Shaiq, Mohammad Daniyal
    Mirza, Farhaan
    Kamal, Ali
    Naeem, M. Asif
    PEERJ COMPUTER SCIENCE, 2024, 10