LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish

被引:0
|
作者
Lothritz, Cedric [1 ]
Lebichot, Bertrand [1 ]
Allix, Kevin [1 ]
Veiber, Lisa [1 ]
Bissyande, Tegawende F. [1 ]
Klein, Jacques [1 ]
Boytsov, Andrey [2 ]
Goujon, Anne [2 ]
Lefebvre, Clement [2 ]
机构
[1] Univ Luxembourg, 16 Rue Richard Coudenhove Kalergi, L-1359 Luxembourg, Luxembourg
[2] Banque BGL BNP Paribas, 50 Ave JF Kennedy, L-2951 Luxembourg, Luxembourg
关键词
Language Models; Less-Resourced Languages; NLP Datasets;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request.
引用
收藏
页码:5080 / 5089
页数:10
相关论文
共 50 条
  • [1] SAS: Self-Augmentation Strategy for Language Model Pre-training
    Xu, Yifei
    Zhang, Jingqiao
    He, Ru
    Ge, Liangzhu
    Yang, Chao
    Yang, Cheng
    Wu, Ying Nian
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 11586 - 11594
  • [2] Soft Language Clustering for Multilingual Model Pre-training
    Zeng, Jiali
    Jiang, Yufan
    Yin, Yongjing
    Jing, Yi
    Meng, Fandong
    Lin, Binghuai
    Cao, Yunbo
    Zhou, Jie
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 7021 - 7035
  • [3] FlauBERT: Unsupervised Language Model Pre-training for French
    Le, Hang
    Vial, Loic
    Frej, Jibril
    Segonne, Vincent
    Coavoux, Maximin
    Lecouteux, Benjamin
    Allauzen, Alexandre
    Crabbe, Benoit
    Besacier, Laurent
    Schwab, Didier
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2479 - 2490
  • [4] Early Rumor Detection based on Data Augmentation and Pre-training Transformer
    Hu, Yanjun
    Ju, Xinyi
    Ye, Zhousheng
    Khan, Sulaiman
    Yuan, Chengwu
    Lai, Qiran
    Liu, Junqiang
    [J]. 2022 IEEE 12TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), 2022, : 152 - 158
  • [5] ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
    Wang, Weihan
    Yang, Zhen
    Xu, Bin
    Li, Juanzi
    Sun, Yankui
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3135 - 3146
  • [6] Unified Language Model Pre-training for Natural Language Understanding and Generation
    Dong, Li
    Yang, Nan
    Wang, Wenhui
    Wei, Furu
    Liu, Xiaodong
    Wang, Yu
    Gao, Jianfeng
    Zhou, Ming
    Hon, Hsiao-Wuen
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [7] Simultaneously Training and Compressing Vision-and-Language Pre-Training Model
    Qi, Qiaosong
    Zhang, Aixi
    Liao, Yue
    Sun, Wenyu
    Wang, Yongliang
    Li, Xiaobo
    Liu, Si
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8194 - 8203
  • [8] On the importance of pre-training data volume for compact language models
    Micheli, Vincent
    D'Hoffschmidt, Martin
    Fleuret, Francois
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 7853 - 7858
  • [9] Conditional Embedding Pre-Training Language Model for Image Captioning
    Li, Pengfei
    Zhang, Min
    Lin, Peijie
    Wan, Jian
    Jiang, Ming
    [J]. NEURAL PROCESSING LETTERS, 2022, 54 (06) : 4987 - 5003
  • [10] Pre-training A Prompt Pool for Vision-Language Model
    Liu, Jun
    Gu, Yang
    Yang, Zhaohua
    Guo, Shuai
    Liu, Huaqiu
    Chen, Yiqiang
    [J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,