LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish

被引:0
|
作者
Lothritz, Cedric [1 ]
Lebichot, Bertrand [1 ]
Allix, Kevin [1 ]
Veiber, Lisa [1 ]
Bissyande, Tegawende F. [1 ]
Klein, Jacques [1 ]
Boytsov, Andrey [2 ]
Goujon, Anne [2 ]
Lefebvre, Clement [2 ]
机构
[1] Univ Luxembourg, 16 Rue Richard Coudenhove Kalergi, L-1359 Luxembourg, Luxembourg
[2] Banque BGL BNP Paribas, 50 Ave JF Kennedy, L-2951 Luxembourg, Luxembourg
关键词
Language Models; Less-Resourced Languages; NLP Datasets;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request.
引用
收藏
页码:5080 / 5089
页数:10
相关论文
共 50 条
  • [21] MGeo: Multi-Modal Geographic Language Model Pre-Training
    Ding, Ruixue
    Chen, Boli
    Xie, Pengjun
    Huang, Fei
    Li, Xin
    Zhang, Qiang
    Xu, Yao
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 185 - 194
  • [22] Knowledge distilled pre-training model for vision-language-navigation
    Bo Huang
    Shuai Zhang
    Jitao Huang
    Yijun Yu
    Zhicai Shi
    Yujie Xiong
    [J]. Applied Intelligence, 2023, 53 : 5607 - 5619
  • [23] Continual Pre-Training of Python Language Model to mT5
    Kajiura, Teruno
    Souma, Nao
    Sato, Miyu
    Kuramitsu, Kimio
    [J]. Computer Software, 2023, 40 (04): : 10 - 21
  • [24] MoDNA: Motif-Oriented Pre-training For DNA Language Model
    An, Weizhi
    Guo, Yuzhi
    Bian, Yatao
    Ma, Hehuan
    Yang, Jinyu
    Li, Chunyuan
    Huang, Junzhou
    [J]. 13TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND HEALTH INFORMATICS, BCB 2022, 2022,
  • [25] Lightweight Model Pre-Training via Language Guided Knowledge Distillation
    Li, Mingsheng
    Zhang, Lin
    Zhu, Mingzhen
    Huang, Zilong
    Yu, Gang
    Fan, Jiayuan
    Chen, Tao
    [J]. IEEE Transactions on Multimedia, 2024, 26 : 10720 - 10730
  • [26] Knowledge distilled pre-training model for vision-language-navigation
    Huang, Bo
    Zhang, Shuai
    Huang, Jitao
    Yu, Yijun
    Shi, Zhicai
    Xiong, Yujie
    [J]. APPLIED INTELLIGENCE, 2023, 53 (05) : 5607 - 5619
  • [27] Dict-BERT: Enhancing Language Model Pre-training with Dictionary
    Yu, Wenhao
    Zhu, Chenguang
    Fang, Yuwei
    Yu, Donghan
    Wang, Shuohang
    Xu, Yichong
    Zeng, Michael
    Jiang, Meng
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1907 - 1918
  • [28] Survey on Vision-language Pre-training
    Yin, Jiong
    Zhang, Zhe-Dong
    Gao, Yu-Han
    Yang, Zhi-Wen
    Li, Liang
    Xiao, Mang
    Sun, Yao-Qi
    Yan, Cheng-Gang
    [J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2000 - 2023
  • [29] Sigmoid Loss for Language Image Pre-Training
    Zhai, Xiaohua
    Mustafa, Basil
    Kolesnikov, Alexander
    Beyer, Lucas
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11941 - 11952
  • [30] Multimodal Data Matters: Language Model Pre-Training Over Structured and Unstructured Electronic Health Records
    Liu, Sicen
    Wang, Xiaolong
    Hou, Yongshuai
    Li, Ge
    Wang, Hui
    Xu, Hui
    Xiang, Yang
    Tang, Buzhou
    [J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2023, 27 (01) : 504 - 514