Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks

被引:0
|
作者
Nag, Arijit [1 ]
Samanta, Bidisha [1 ]
Mukherjee, Animesh [1 ]
Ganguly, Niloy [1 ]
Chakrabarti, Soumen [2 ]
机构
[1] IIT Kharagpur, Kharagpur, W Bengal, India
[2] Indian Inst Technol, Bombay, Maharashtra, India
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Multilingual language models (MLLMs) like mBERT promise to extend the benefits of NLP research to low-resource languages (LRLs). However, LRL words are under-represented in the wordpiece/subword vocabularies of MLLMs. This leads to many LRL words getting replaced by UNK, or concatenated from morphologically unrelated wordpieces, leading to low task accuracy. (Pre)-training MLLMs after including LRL documents is resource-intensive in terms of both human inputs and computational resources. In response, we propose EVALM (entropy-based vocabulary augmented language model), which uses a new task-cognizant measurement to detect the most vulnerable LRL words, whose wordpiece segmentations are undesirable. EVALM then provides reasonable initializations of their embeddings, followed by limited fine-tuning using the small LRL task corpus. Our experiments show significant performance improvements and also some surprising limits to such vocabulary augmentation strategies in various classification tasks for multiple diverse LRLs, as well as code-mixed texts. We will release the code and data to enable further research(1).
引用
收藏
页码:8619 / 8629
页数:11
相关论文
共 50 条
  • [41] Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification
    Basu, Joyanta
    Khan, Soma
    Roy, Rajib
    Basu, Tapan Kumar
    Majumder, Swanirbhar
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2021, 40 (10) : 4986 - 5013
  • [42] A General Procedure for Improving Language Models in Low-Resource Speech Recognition
    Liu, Qian
    Zhang, Wei-Qiang
    Liu, Jia
    Liu, Yao
    PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 428 - 433
  • [43] A Study on Low-resource Language Identification
    Qi, Zhaodi
    Ma, Yong
    Gu, Mingliang
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1897 - 1902
  • [44] Filtering and Extended Vocabulary based Translation for Low-resource Language Pair of Sanskrit-Hindi
    Jha, Piyush
    Kumar, Rashi
    Sahula, Vineet
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (04)
  • [45] Multilingual phone models for vocabulary-independent speech recognition tasks
    Köhler, J
    SPEECH COMMUNICATION, 2001, 35 (1-2) : 21 - 30
  • [46] Data Augmentation, Feature Combination, and Multilingual Neural Networks to Improve ASR and KWS Performance for Low-resource Languages
    Tueske, Zoltan
    Golik, Pavel
    Nolden, David
    Schlueter, Ralf
    Ney, Hermann
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 1420 - 1424
  • [47] Improving Data Augmentation for Low-Resource NMT Guided by POS-Tagging and Paraphrase Embedding
    Maimaiti, Mieradilijiang
    Liu, Yang
    Luan, Huanbo
    Pan, Zegao
    Sun, Maosong
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2021, 20 (06)
  • [48] ON-TRAC' systems for the IWSLT 2021 low-resource speech translation and multilingual speech translation shared tasks
    Lee, Hang
    Barbier, Florentin
    Ha Nguyen
    Tomanshenko, Natalia
    Mdhaffar, Salima
    Gahbiche, Souhir
    Bougares, Fethi
    Lecouteux, Benjamin
    Schwabe, Didier
    Esteve, Yannick
    IWSLT 2021: THE 18TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION, 2021, : 169 - 174
  • [49] Evaluating zero-shot transfers and multilingual models for dependency parsing and POS tagging within the low-resource language family Tupian
    Blum, Frederic
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): STUDENT RESEARCH WORKSHOP, 2022, : 1 - 9
  • [50] MULTILINGUAL SHIFTING DEEP BOTTLENECK FEATURES FOR LOW-RESOURCE ASR
    Quoc Bao Nguyen
    Gehring, Jonas
    Mueller, Markus
    Stueker, Sebastian
    Waibel, Alex
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,