LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish

被引：0

作者：

Lothritz, Cedric ^{[1
]}

Lebichot, Bertrand ^{[1
]}

Allix, Kevin ^{[1
]}

Veiber, Lisa ^{[1
]}

Bissyande, Tegawende F. ^{[1
]}

Klein, Jacques ^{[1
]}

Boytsov, Andrey ^{[2
]}

Goujon, Anne ^{[2
]}

Lefebvre, Clement ^{[2
]}

机构：

[1] Univ Luxembourg, 16 Rue Richard Coudenhove Kalergi, L-1359 Luxembourg, Luxembourg

[2] Banque BGL BNP Paribas, 50 Ave JF Kennedy, L-2951 Luxembourg, Luxembourg

来源：

LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2022年

关键词：

Language Models; Less-Resourced Languages; NLP Datasets;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request.

引用

页码：5080 / 5089

页数：10

共 50 条

[21] MGeo: Multi-Modal Geographic Language Model Pre-Training
Ding, Ruixue
Chen, Boli
Xie, Pengjun
Huang, Fei
Li, Xin
Zhang, Qiang
Xu, Yao
[J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 185 - 194
[22] Knowledge distilled pre-training model for vision-language-navigation
Bo Huang
Shuai Zhang
Jitao Huang
Yijun Yu
Zhicai Shi
Yujie Xiong
[J]. Applied Intelligence, 2023, 53 : 5607 - 5619
[23] Continual Pre-Training of Python Language Model to mT5
Kajiura, Teruno
Souma, Nao
Sato, Miyu
Kuramitsu, Kimio
[J]. Computer Software, 2023, 40 (04): : 10 - 21
[24] MoDNA: Motif-Oriented Pre-training For DNA Language Model
An, Weizhi
Guo, Yuzhi
Bian, Yatao
Ma, Hehuan
Yang, Jinyu
Li, Chunyuan
Huang, Junzhou
[J]. 13TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND HEALTH INFORMATICS, BCB 2022, 2022,
[25] Lightweight Model Pre-Training via Language Guided Knowledge Distillation
Li, Mingsheng
Zhang, Lin
Zhu, Mingzhen
Huang, Zilong
Yu, Gang
Fan, Jiayuan
Chen, Tao
[J]. IEEE Transactions on Multimedia, 2024, 26 : 10720 - 10730
[26] Knowledge distilled pre-training model for vision-language-navigation
Huang, Bo
Zhang, Shuai
Huang, Jitao
Yu, Yijun
Shi, Zhicai
Xiong, Yujie
[J]. APPLIED INTELLIGENCE, 2023, 53 (05) : 5607 - 5619
[27] Dict-BERT: Enhancing Language Model Pre-training with Dictionary
Yu, Wenhao
Zhu, Chenguang
Fang, Yuwei
Yu, Donghan
Wang, Shuohang
Xu, Yichong
Zeng, Michael
Jiang, Meng
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1907 - 1918
[28] Survey on Vision-language Pre-training
Yin, Jiong
Zhang, Zhe-Dong
Gao, Yu-Han
Yang, Zhi-Wen
Li, Liang
Xiao, Mang
Sun, Yao-Qi
Yan, Cheng-Gang
[J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2000 - 2023
[29] Sigmoid Loss for Language Image Pre-Training
Zhai, Xiaohua
Mustafa, Basil
Kolesnikov, Alexander
Beyer, Lucas
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11941 - 11952
[30] Multimodal Data Matters: Language Model Pre-Training Over Structured and Unstructured Electronic Health Records
Liu, Sicen
Wang, Xiaolong
Hou, Yongshuai
Li, Ge
Wang, Hui
Xu, Hui
Xiang, Yang
Tang, Buzhou
[J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2023, 27 (01) : 504 - 514

← 1 2 3 4 5 →