A Comprehensive Analysis of Various Tokenizers for Arabic Large Language Models

被引:0
|
作者
Qarah, Faisal [1 ]
Alsanoosy, Tawfeeq [1 ]
机构
[1] Taibah Univ, Coll Comp Sci & Engn, Dept Comp Sci, Madinah 42353, Saudi Arabia
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 13期
关键词
large language models; BERT; Arabic language; natural language processing; tokenizer; distributed computing; NLP applications;
D O I
10.3390/app14135696
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Pretrained language models have achieved great success in various natural language understanding (NLU) tasks due to their capacity to capture deep contextualized information in text using pretraining on large-scale corpora. Tokenization plays a significant role in the process of lexical analysis. Tokens become the input for other natural language processing (NLP) tasks, like semantic parsing and language modeling. However, there is a lack of research on the evaluation of the impact of tokenization on the Arabic language model. Therefore, this study aims to address this gap in the literature by evaluating the performance of various tokenizers on Arabic large language models (LLMs). In this paper, we analyze the differences between WordPiece, SentencePiece, and BBPE tokenizers by pretraining three BERT models using each tokenizer while measuring the performance of each model on seven different NLP tasks using 29 different datasets. Overall, the model pretrained with text tokenized using the SentencePiece tokenizer significantly outperforms the other two models that utilize WordPiece and BBPE tokenizers. The results of this paper will assist researchers in developing better models, making better decisions in selecting the best tokenizers, improving feature engineering, and making models more efficient, thus ultimately leading to advancements in various NLP applications.
引用
收藏
页数:17
相关论文
共 50 条
  • [21] Multimodal large language models for bioimage analysis
    Zhang, Shanghang
    Dai, Gaole
    Huang, Tiejun
    Chen, Jianxu
    NATURE METHODS, 2024, 21 (08) : 1390 - 1393
  • [22] Towards Developing a Comprehensive Tag Set for the Arabic Language
    Alqrainy, Shihadeh
    Alawairdhi, Muhammed
    JOURNAL OF INTELLIGENT SYSTEMS, 2021, 30 (01) : 287 - 296
  • [23] Using Large Language Models to Improve Sentiment Analysis in Latvian Language
    Purvins, Pauls
    Urtans, Evalds
    Caune, Vairis
    BALTIC JOURNAL OF MODERN COMPUTING, 2024, 12 (02): : 165 - 175
  • [24] Comparative Analysis of Large Language Models in Source Code Analysis
    Erdoğan, Hüseyin
    Turan, Nezihe Turhan
    Onan, Aytuğ
    Lecture Notes in Networks and Systems, 2024, 1088 LNNS : 185 - 192
  • [25] Unveiling the Impact of Large Language Models on Student Learning: A Comprehensive Case Study
    Zdravkova, Katerina
    Dalipi, Fisnik
    Ahlgren, Fredrik
    Ilijoski, Bojan
    Ohlsson, Tobias
    2024 IEEE GLOBAL ENGINEERING EDUCATION CONFERENCE, EDUCON 2024, 2024,
  • [26] Leveraging Generative AI and Large Language Models: A Comprehensive Roadmap for Healthcare Integration
    Yu, Ping
    Xu, Hua
    Hu, Xia
    Deng, Chao
    HEALTHCARE, 2023, 11 (20)
  • [27] A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks
    Jahan, Israt
    Laskar, Md Tahmid Rahman
    Peng, Chun
    Huang, Jimmy Xiangji
    COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 171
  • [28] Application of Holistic Artificial Intelligence and Large Language Models for Comprehensive Information Collection
    Han, Xu
    Sun, Yawei
    Zhao, Lu
    Beijing Youdian Daxue Xuebao/Journal of Beijing University of Posts and Telecommunications, 2024, 47 (04): : 11 - 19
  • [29] A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models
    Esmradi, Aysan
    Yip, Daniel Wankit
    Chan, Chun Fai
    UBIQUITOUS SECURITY, UBISEC 2023, 2024, 2034 : 76 - 95
  • [30] Leveraging large language models for comprehensive locomotion control in humanoid robots design
    Sun, Shilong
    Li, Chiyao
    Zhao, Zida
    Huang, Haodong
    Xu, Wenfu
    Biomimetic Intelligence and Robotics, 2024, 4 (04):