In the heart of Swahili: An exploration of data collection methods and corpus curation for natural language processing

被引：0

作者：

Masua, Bernard ^{[1
]}

Masasi, Noel ^{[1
]}

机构：

[1] Univ Dar Es Salaam, Coll Informat & Commun Technol CoICT, Ali Hassan Mwinyi Rd,Kijitonyama Campus, TZ-33335 Dar Es Salaam, Tanzania

来源：

DATA IN BRIEF | 2024年 / 55卷

关键词：

Text pre-processing; Swahili language; Corpus; Machine learning;

D O I：

10.1016/j.dib.2024.110751

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Swahili corpus is a dataset generated by collecting written Kiswahili sentences from different sectors that deals with Kiswahili documents. Corpus of intended language is needed in Natural Language Processing (NLP) task to fit algorithm in order to understand that language before training the model. Swahili corpus dataset generated contained 1,693,228 sentences with 39,639,824 words and 871,452 unique words. Corpus exported in text file format with storage size of 168 MB. These sentences collected from different sources in different categories as follows:- Health (AFYA), Business and Industries (BIASHARA), Parliament (BUNGE), Religion (DINI), Education (ELIMU), News (HABARI), Agriculture (KILIMO), Social Media (MITANDAO), Non-Governmental Organizations (MASHIRIKA YA KIRAIA), Government (SERIKALI), Laws (SHERIA) and Politics (SIASA). This abstract outlines the systematic data collection process employed for the creation of a Swahili corpus derived from multiple public websites and reports. The compilation of this corpus involves a meticulous and comprehensive approach to ensure the representation of diverse linguistic contexts and topics relevant to the Swahili language. The data collection process commenced with the identification of suitable sources across various domains, including news articles, health publications, online forums, and Governmental public reports. Websites and platforms with pub licly available Swahili content were systematically crawled and archived to capture a broad spectrum of linguistic expressions. Furthermore, special attention was given to reputable sources to maintain the authenticity of the corpus and linguistic richness. The inclusion of diverse sources ensures that the corpus reflects the linguistic nuances inherent in different contexts and registers within the Swahili language. Additionally, effort s were made to incorporate variations in domain dialects, acknowledging the linguistic diversity present in Swahili. The potential for reusing this Swahili corpus is vast. Researchers, linguists, and language enthusiasts can leverage the diverse and extensive dataset for a multitude of applications, including NLP tasks such as sentiment analysis, textual data clustering, classifications tasks and machine translation. The Corpus can serve as training data for developing and evaluating NLP algorithms, including part-of-speech tagging, and named entity recognition. Also, text mining techniques can be applied to corpus and enable researchers to extract valuable insights, identify patterns, and discover knowledge from large textual datasets. (c) 2024 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY-NC license ( http://creativecommons.org/licenses/by-nc/4.0/ )

引用

页数：9

共 50 条

[31] Drug and Natural Health Product Data Collection and Curation in the Canadian Longitudinal Study on Aging
Cossette, Benoit
Griffith, Lauren
Emond, Patrick D.
Mangin, Dee
Moss, Lorraine
Boyko, Jennifer
Nicholson, Kathryn
Ma, Jinhui
Raina, Parminder
Wolfson, Christina
Kirkland, Susan
Dolovich, Lisa
CANADIAN JOURNAL ON AGING-REVUE CANADIENNE DU VIEILLISSEMENT, 2024,
[32] Leveraging Natural Language Processing for Echocardiographic Data Extraction in Hypoplastic Left Heart Syndrome
Girvin, Zachary
Gangireddy, Srushti
Coleman, Andersen
Ong, Henry
Wei, Wei-Qi
Kannankeril, Prince
Sunthankar, Sudeep
CIRCULATION, 2024, 150
[33] Research and Exploration on Chinese Natural Language Processing in Era of Large Language Models
大模型时代下的汉语自然语言处理研究与探索
Xi, Xuefeng (xfxi@mail.usts.edu.cn), 2025, 61 (01) : 80 - 97
[34] Corpus-based approaches to semantic interpretation in natural language processing
Ng, HT
Zelle, J
AI MAGAZINE, 1997, 18 (04) : 45 - 64
[35] UMUCorpusClassifier: Compilation and evaluation of linguistic corpus for Natural Language Processing tasks
Antonio Garcia-Diaz, Jose
Almela, Angela
Alcaraz-Marmol, Gema
Valencia-Garcia, Rafael
PROCESAMIENTO DEL LENGUAJE NATURAL, 2020, (65): : 139 - 142
[36] Dunn. 2022. Natural Language Processing for Corpus Linguistics
Zhang, Yujiao
CORPORA, 2024, 19 (02) : 259 - 262
[37] Disambiguating Verbs by Collocation: Corpus Lexicography meets Natural Language Processing
El Maarouf, Ismail
Baisa, Vit
Bradbury, Jane
Hanks, Patrick
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 1001 - 1006
[38] Anonymising a French SMS corpus using natural language processing techniques
Accorsi, Pierre
Patel, Namrata
Lopez, Cedric
Panckhurst, Rachel
Roche, Mathieu
LINGUISTICAE INVESTIGATIONES, 2012, 35 (02): : 163 - 180
[39] Natural language as programming paradigm in data exploration domain
Laukaitis, Algirdas
Vasilecas, Olegas
INFORMATION TECHNOLOGY AND CONTROL, 2007, 36 (01): : 30 - 36
[40] Data augmentation techniques in natural language processing
Pellicer, Lucas Francisco Amaral Orosco
Ferreira, Taynan Maier
Costa, Anna Helena Reali
APPLIED SOFT COMPUTING, 2023, 132

← 1 2 3 4 5 →