Investigating the impact of pretraining corpora on the performance of Arabic BERT models

被引:0
|
作者
Alammary, Ali Saleh [1 ]
机构
[1] College of Computing and Informatics, Saudi Electronic University, Jeddah, Saudi Arabia
来源
Journal of Supercomputing | 2025年 / 81卷 / 01期
关键词
Natural language processing systems;
D O I
10.1007/s11227-024-06698-2
中图分类号
学科分类号
摘要
Bidirectional Encoder Representations from Transformers (BERT), a revolutionary model in natural language processing (NLP), has significantly impacted text-related tasks including text classification. Several BERT models were developed for Arabic language. While many studies have compared their overall performance for text classification tasks, none has dug deeper and investigated the relationship between their pretraining data and their performance. This study investigates this relationship by utilizing ten models and evaluating them on eight diverse classification tasks using metrics such as accuracy and F1 score. Results revealed variations in performance across tasks which was mainly due to the models pretraining corpora. The study emphasizes the impact of pretraining data size, quality, and diversity on model adaptability. Models pretrained on specific corpora, despite larger sizes, may not outperform those pretrained on more diverse datasets. Notably, domain-specific tasks, such as medical and poetry classification, unveiled performance gaps compared to the original English BERT. The findings suggest the necessity of reevaluating the pretraining approach for Arabic BERT models. Balancing quantity and quality in pretraining corpora, spanning various domains, is identified as crucial. The study provides insights into optimizing pretraining strategies for enhanced performance and adaptability of Arabic BERT models in diverse text classification tasks, offering valuable guidance for researchers and practitioners in the field of NLP. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
引用
下载
收藏
相关论文
共 50 条
  • [31] Investigating the performance impact of caches: An experimental approach
    Ismail, Nabil A.
    AEJ - Alexandria Engineering Journal, 2002, 41 (04): : 683 - 701
  • [32] Investigating the impact of face categorization on recognition performance
    Veropoulos, K
    Bebis, G
    Webster, M
    ADVANCES IN VISUAL COMPUTING, PROCEEDINGS, 2005, 3804 : 207 - 218
  • [33] Investigating security investment impact on firm performance
    Bose, Ranjit
    Luo, Xin
    INTERNATIONAL JOURNAL OF ACCOUNTING AND INFORMATION MANAGEMENT, 2014, 22 (03) : 194 - +
  • [34] Investigating the impact of data normalization on classification performance
    Singh, Dalwinder
    Singh, Birmohan
    APPLIED SOFT COMPUTING, 2020, 97
  • [35] Investigating the Impact of Train / Test Split Ratio on the Performance of Pre-Trained Models with Custom Datasets
    Bichri, Houda
    Chergui, Adil
    Hain, Mustapha
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (02) : 331 - 339
  • [36] Investigating the Impact of Mobility Models on MANET Routing Protocols
    Abdullah, Ako Muhammad
    Ozen, Emre
    Bayramoglu, Husnu
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2019, 10 (02) : 25 - 35
  • [37] Investigating the performance of personalized models for software defect prediction
    Eken, Beyza
    Tosun, Ayse
    JOURNAL OF SYSTEMS AND SOFTWARE, 2021, 181
  • [38] IMPLI: Investigating NLI Models' Performance on Figurative Language
    Stowe, Kevin
    Utama, Prasetya Ajie
    Gurevych, Iryna
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 5375 - 5388
  • [39] Performance of Multiple Pretrained BERT Models to Automate and Accelerate Data Annotation for Large Datasets
    Tejani, Ali S.
    Ng, Yee S.
    Xi, Yin
    Fielding, Julia R.
    Browning, Travis G.
    Rayan, Jesse C.
    RADIOLOGY-ARTIFICIAL INTELLIGENCE, 2022, 4 (04)
  • [40] Exploring the performance and explainability of fine-tuned BERT models for neuroradiology protocol assignment
    Salmonn Talebi
    Elizabeth Tong
    Anna Li
    Ghiam Yamin
    Greg Zaharchuk
    Mohammad R. K. Mofrad
    BMC Medical Informatics and Decision Making, 24