A Comprehensive Analysis of Various Tokenizers for Arabic Large Language Models

被引:0
|
作者
Qarah, Faisal [1 ]
Alsanoosy, Tawfeeq [1 ]
机构
[1] Taibah Univ, Coll Comp Sci & Engn, Dept Comp Sci, Madinah 42353, Saudi Arabia
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 13期
关键词
large language models; BERT; Arabic language; natural language processing; tokenizer; distributed computing; NLP applications;
D O I
10.3390/app14135696
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Pretrained language models have achieved great success in various natural language understanding (NLU) tasks due to their capacity to capture deep contextualized information in text using pretraining on large-scale corpora. Tokenization plays a significant role in the process of lexical analysis. Tokens become the input for other natural language processing (NLP) tasks, like semantic parsing and language modeling. However, there is a lack of research on the evaluation of the impact of tokenization on the Arabic language model. Therefore, this study aims to address this gap in the literature by evaluating the performance of various tokenizers on Arabic large language models (LLMs). In this paper, we analyze the differences between WordPiece, SentencePiece, and BBPE tokenizers by pretraining three BERT models using each tokenizer while measuring the performance of each model on seven different NLP tasks using 29 different datasets. Overall, the model pretrained with text tokenized using the SentencePiece tokenizer significantly outperforms the other two models that utilize WordPiece and BBPE tokenizers. The results of this paper will assist researchers in developing better models, making better decisions in selecting the best tokenizers, improving feature engineering, and making models more efficient, thus ultimately leading to advancements in various NLP applications.
引用
收藏
页数:17
相关论文
共 50 条
  • [41] Trend Extraction and Analysis via Large Language Models
    Soru, Tommaso
    Marshall, Jim
    18TH IEEE INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING, ICSC 2024, 2024, : 285 - 288
  • [42] Analysis of Privacy Leakage in Federated Large Language Models
    Vu, Minh N.
    Nguyen, Truc
    Jeter, Tre' R.
    Thai, My T.
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
  • [43] Large language models in textual analysis for gesture selection
    Hensel, Laura B.
    Yongsatianchot, Nutchanon
    Torshizi, Parisa
    Minucci, Elena
    Marsella, Stacy
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, : 378 - 387
  • [44] Large Language Models
    Vargas, Diego Collarana
    Katsamanis, Nassos
    ERCIM NEWS, 2024, (136): : 12 - 13
  • [45] Large Language Models
    Cerf, Vinton G.
    COMMUNICATIONS OF THE ACM, 2023, 66 (08) : 7 - 7
  • [46] Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
    Shang, Guokan
    Abdine, Hadi
    Khoubrane, Yousef
    Mohamed, Amr
    Abbahaddou, Yassine
    Ennadir, Sofiane
    Momayiz, Imane
    Ren, Xuguang
    Moulines, Eric
    Nakov, Preslav
    Vazirgiannis, Michalis
    Xing, Eric
    arXiv,
  • [47] Opinion Mining and Analysis for Arabic Language
    Al-Kabi, Mohammed N.
    Gigieh, Amal H.
    Alsmadi, Izzat M.
    Wahsheh, Heider A.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2014, 5 (05) : 181 - 195
  • [48] Optimizing Large Language Models for Arabic Healthcare Communication: A Focus on Patient-Centered NLP Applications
    Mohammad, Rasheed
    Alkhnbashi, Omer S.
    Hammoudeh, Mohammad
    Big Data and Cognitive Computing, 2024, 8 (11)
  • [49] Error Analysis of Pretrained Language Models (PLMs) in English-to-Arabic Machine Translation
    Hend Al-Khalifa
    Khaloud Al-Khalefah
    Hesham Haroon
    Human-Centric Intelligent Systems, 2024, 4 (2): : 206 - 219
  • [50] Item Analysis Of Arabic Language Examination
    Mahmudi, Ihwan
    Nurwardah, Afni
    Rochma, Siti Nikmatul
    Nurcholis, Agung
    IJAZ ARABI JOURNAL OF ARABIC LEARNING, 2023, 6 (03): : 563 - 573