Investigating the impact of pretraining corpora on the performance of Arabic BERT models

被引:0
|
作者
Alammary, Ali Saleh [1 ]
机构
[1] College of Computing and Informatics, Saudi Electronic University, Jeddah, Saudi Arabia
来源
Journal of Supercomputing | 2025年 / 81卷 / 01期
关键词
Natural language processing systems;
D O I
10.1007/s11227-024-06698-2
中图分类号
学科分类号
摘要
Bidirectional Encoder Representations from Transformers (BERT), a revolutionary model in natural language processing (NLP), has significantly impacted text-related tasks including text classification. Several BERT models were developed for Arabic language. While many studies have compared their overall performance for text classification tasks, none has dug deeper and investigated the relationship between their pretraining data and their performance. This study investigates this relationship by utilizing ten models and evaluating them on eight diverse classification tasks using metrics such as accuracy and F1 score. Results revealed variations in performance across tasks which was mainly due to the models pretraining corpora. The study emphasizes the impact of pretraining data size, quality, and diversity on model adaptability. Models pretrained on specific corpora, despite larger sizes, may not outperform those pretrained on more diverse datasets. Notably, domain-specific tasks, such as medical and poetry classification, unveiled performance gaps compared to the original English BERT. The findings suggest the necessity of reevaluating the pretraining approach for Arabic BERT models. Balancing quantity and quality in pretraining corpora, spanning various domains, is identified as crucial. The study provides insights into optimizing pretraining strategies for enhanced performance and adaptability of Arabic BERT models in diverse text classification tasks, offering valuable guidance for researchers and practitioners in the field of NLP. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
引用
收藏
相关论文
共 50 条
  • [1] Evaluating Pretraining Strategies for Clinical BERT Models
    Lamproudis, Anastasios
    Henriksson, Aron
    Dalianis, Hercules
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 410 - 416
  • [2] BERT models for Brazilian Portuguese: Pretraining, evaluation and tokenization analysis
    Souza, F. C.
    Nogueira, R. F.
    Lotufo, R. A.
    [J]. APPLIED SOFT COMPUTING, 2023, 149
  • [3] Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora
    Jin, Xisen
    Zhang, Dejiao
    Zhu, Henghui
    Xiao, Wei
    Li, Shang-Wen
    Wei, Xiaokai
    Arnold, Andrew
    Ren, Xiang
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 4764 - 4780
  • [4] Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora
    Jin, Xisen
    Zhang, Dejiao
    Zhu, Henghui
    Xiao, Wei
    Li, Shang-Wen
    Wei, Xiaokai
    Arnold, Andrew
    Ren, Xiang
    [J]. PROCEEDINGS OF WORKSHOP ON CHALLENGES & PERSPECTIVES IN CREATING LARGE LANGUAGE MODELS (BIGSCIENCE EPISODE #5), 2022, : 1 - 16
  • [5] Arabic Syntactic Diacritics Restoration Using BERT Models
    Nazih, Waleed
    Hifny, Yasser
    [J]. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022
  • [6] Stacking BERT based Models for Arabic Sentiment Analysis
    Chouikhi, Hasna
    Chniter, Hamza
    Jarray, Fethi
    [J]. PROCEEDINGS OF THE 13TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT (KEOD), VOL 2, 2021, : 144 - 150
  • [7] BERT Models for Arabic Text Classification: A Systematic Review
    Alammary, Ali Saleh
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (11):
  • [8] The Impact of Combining Arabic Sarcasm Detection Datasets On The Performance Of BERT-based Model
    Obeidat, Rasha
    Bashayreh, Amjad
    Younis, Lojin Bani
    [J]. 2022 13TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION SYSTEMS (ICICS), 2022, : 22 - 29
  • [9] Stacking of BERT and CNN Models for Arabic Word Sense Disambiguation
    Saidi, Rakia
    Jarray, Fethi
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (11)
  • [10] Impact of Emojis Exclusion on the Performance of Arabic Sarcasm Detection Models
    Aleryani, Ghalyah
    Deabes, Wael
    Albishre, Khaled
    Abdel-Hakim, Alaa E.
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (08) : 1315 - 1322