A Comprehensive Analysis of Various Tokenizers for Arabic Large Language Models

被引:0
|
作者
Qarah, Faisal [1 ]
Alsanoosy, Tawfeeq [1 ]
机构
[1] Taibah Univ, Coll Comp Sci & Engn, Dept Comp Sci, Madinah 42353, Saudi Arabia
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 13期
关键词
large language models; BERT; Arabic language; natural language processing; tokenizer; distributed computing; NLP applications;
D O I
10.3390/app14135696
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Pretrained language models have achieved great success in various natural language understanding (NLU) tasks due to their capacity to capture deep contextualized information in text using pretraining on large-scale corpora. Tokenization plays a significant role in the process of lexical analysis. Tokens become the input for other natural language processing (NLP) tasks, like semantic parsing and language modeling. However, there is a lack of research on the evaluation of the impact of tokenization on the Arabic language model. Therefore, this study aims to address this gap in the literature by evaluating the performance of various tokenizers on Arabic large language models (LLMs). In this paper, we analyze the differences between WordPiece, SentencePiece, and BBPE tokenizers by pretraining three BERT models using each tokenizer while measuring the performance of each model on seven different NLP tasks using 29 different datasets. Overall, the model pretrained with text tokenized using the SentencePiece tokenizer significantly outperforms the other two models that utilize WordPiece and BBPE tokenizers. The results of this paper will assist researchers in developing better models, making better decisions in selecting the best tokenizers, improving feature engineering, and making models more efficient, thus ultimately leading to advancements in various NLP applications.
引用
收藏
页数:17
相关论文
共 50 条
  • [31] A comprehensive evaluation of large language models in mining gene relations and pathway knowledge
    Azam, Muhammad
    Chen, Yibo
    Arowolo, Micheal Olaolu
    Liu, Haowang
    Popescu, Mihail
    Xu, Dong
    QUANTITATIVE BIOLOGY, 2024, 12 (04) : 360 - 374
  • [32] A comprehensive evaluation of large language models in mining gene relations and pathway knowledge
    Muhammad Azam
    Yibo Chen
    Micheal Olaolu Arowolo
    Haowang Liu
    Mihail Popescu
    Dong Xu
    Quantitative Biology, 2024, 12 (04) : 360 - 374
  • [33] Large language models for cyber resilience: A comprehensive review, challenges, and future perspectives
    Ding, Weiping
    Abdel-Basset, Mohamed
    Ali, Ahmed M.
    Moustafa, Nour
    Applied Soft Computing, 2025, 170
  • [34] A comprehensive survey of arabic sentiment analysis
    Al-Ayyoub, Mahmoud
    Khamaiseh, Abed Allah
    Jararweh, Yaser
    Al-Kabi, Mohammed N.
    INFORMATION PROCESSING & MANAGEMENT, 2019, 56 (02) : 320 - 342
  • [35] Large Language Models are Not Models of Natural Language: They are Corpus Models
    Veres, Csaba
    IEEE ACCESS, 2022, 10 : 61970 - 61979
  • [36] Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study
    Yuepeng Wu
    Yukang Zhang
    Mei Xu
    Chen Jinzhi
    Yican Xue
    Yuchen Zheng
    BMC Medical Informatics and Decision Making, 25 (1)
  • [37] Large Language Models in Targeted Sentiment Analysis for Russian
    N. Rusnachenko
    A. Golubev
    N. Loukachevitch
    Lobachevskii Journal of Mathematics, 2024, 45 (7) : 3148 - 3158
  • [38] Can Large Language Models Assist in Hazard Analysis?
    Diemert, Simon
    Weber, Jens H.
    COMPUTER SAFETY, RELIABILITY, AND SECURITY, SAFECOMP 2023 WORKSHOPS, 2023, 14182 : 410 - 422
  • [39] Leveraging Large Language Models for Automated Dialogue Analysis
    Finch, Sarah E.
    Paek, Ellie S.
    Choi, Jinho D.
    24TH MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE, SIGDIAL 2023, 2023, : 202 - 215
  • [40] An analysis of large language models: their impact and potential applications
    Bharathi Mohan, G.
    Prasanna Kumar, R.
    Vishal Krishh, P.
    Keerthinathan, A.
    Lavanya, G.
    Meghana, Meka Kavya Uma
    Sulthana, Sheba
    Doss, Srinath
    KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (09) : 5047 - 5070