A Comprehensive Analysis of Various Tokenizers for Arabic Large Language Models

被引：0

作者：

Qarah, Faisal ^{[1
]}

Alsanoosy, Tawfeeq ^{[1
]}

机构：

[1] Taibah Univ, Coll Comp Sci & Engn, Dept Comp Sci, Madinah 42353, Saudi Arabia

来源：

APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 13期

关键词：

large language models; BERT; Arabic language; natural language processing; tokenizer; distributed computing; NLP applications;

D O I：

10.3390/app14135696

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Pretrained language models have achieved great success in various natural language understanding (NLU) tasks due to their capacity to capture deep contextualized information in text using pretraining on large-scale corpora. Tokenization plays a significant role in the process of lexical analysis. Tokens become the input for other natural language processing (NLP) tasks, like semantic parsing and language modeling. However, there is a lack of research on the evaluation of the impact of tokenization on the Arabic language model. Therefore, this study aims to address this gap in the literature by evaluating the performance of various tokenizers on Arabic large language models (LLMs). In this paper, we analyze the differences between WordPiece, SentencePiece, and BBPE tokenizers by pretraining three BERT models using each tokenizer while measuring the performance of each model on seven different NLP tasks using 29 different datasets. Overall, the model pretrained with text tokenized using the SentencePiece tokenizer significantly outperforms the other two models that utilize WordPiece and BBPE tokenizers. The results of this paper will assist researchers in developing better models, making better decisions in selecting the best tokenizers, improving feature engineering, and making models more efficient, thus ultimately leading to advancements in various NLP applications.

引用

页数：17

共 50 条

[21] Multimodal large language models for bioimage analysis
Zhang, Shanghang
Dai, Gaole
Huang, Tiejun
Chen, Jianxu
NATURE METHODS, 2024, 21 (08) : 1390 - 1393
[22] Towards Developing a Comprehensive Tag Set for the Arabic Language
Alqrainy, Shihadeh
Alawairdhi, Muhammed
JOURNAL OF INTELLIGENT SYSTEMS, 2021, 30 (01) : 287 - 296
[23] Using Large Language Models to Improve Sentiment Analysis in Latvian Language
Purvins, Pauls
Urtans, Evalds
Caune, Vairis
BALTIC JOURNAL OF MODERN COMPUTING, 2024, 12 (02): : 165 - 175
[24] Comparative Analysis of Large Language Models in Source Code Analysis
Erdoğan, Hüseyin
Turan, Nezihe Turhan
Onan, Aytuğ
Lecture Notes in Networks and Systems, 2024, 1088 LNNS : 185 - 192
[25] Unveiling the Impact of Large Language Models on Student Learning: A Comprehensive Case Study
Zdravkova, Katerina
Dalipi, Fisnik
Ahlgren, Fredrik
Ilijoski, Bojan
Ohlsson, Tobias
2024 IEEE GLOBAL ENGINEERING EDUCATION CONFERENCE, EDUCON 2024, 2024,
[26] Leveraging Generative AI and Large Language Models: A Comprehensive Roadmap for Healthcare Integration
Yu, Ping
Xu, Hua
Hu, Xia
Deng, Chao
HEALTHCARE, 2023, 11 (20)
[27] A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks
Jahan, Israt
Laskar, Md Tahmid Rahman
Peng, Chun
Huang, Jimmy Xiangji
COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 171
[28] Application of Holistic Artificial Intelligence and Large Language Models for Comprehensive Information Collection
Han, Xu
Sun, Yawei
Zhao, Lu
Beijing Youdian Daxue Xuebao/Journal of Beijing University of Posts and Telecommunications, 2024, 47 (04): : 11 - 19
[29] A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models
Esmradi, Aysan
Yip, Daniel Wankit
Chan, Chun Fai
UBIQUITOUS SECURITY, UBISEC 2023, 2024, 2034 : 76 - 95
[30] Leveraging large language models for comprehensive locomotion control in humanoid robots design
Sun, Shilong
Li, Chiyao
Zhao, Zida
Huang, Haodong
Xu, Wenfu
Biomimetic Intelligence and Robotics, 2024, 4 (04):

← 1 2 3 4 5 →