In the heart of Swahili: An exploration of data collection methods and corpus curation for natural language processing

被引:0
|
作者
Masua, Bernard [1 ]
Masasi, Noel [1 ]
机构
[1] Univ Dar Es Salaam, Coll Informat & Commun Technol CoICT, Ali Hassan Mwinyi Rd,Kijitonyama Campus, TZ-33335 Dar Es Salaam, Tanzania
来源
DATA IN BRIEF | 2024年 / 55卷
关键词
Text pre-processing; Swahili language; Corpus; Machine learning;
D O I
10.1016/j.dib.2024.110751
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Swahili corpus is a dataset generated by collecting written Kiswahili sentences from different sectors that deals with Kiswahili documents. Corpus of intended language is needed in Natural Language Processing (NLP) task to fit algorithm in order to understand that language before training the model. Swahili corpus dataset generated contained 1,693,228 sentences with 39,639,824 words and 871,452 unique words. Corpus exported in text file format with storage size of 168 MB. These sentences collected from different sources in different categories as follows:- Health (AFYA), Business and Industries (BIASHARA), Parliament (BUNGE), Religion (DINI), Education (ELIMU), News (HABARI), Agriculture (KILIMO), Social Media (MITANDAO), Non-Governmental Organizations (MASHIRIKA YA KIRAIA), Government (SERIKALI), Laws (SHERIA) and Politics (SIASA). This abstract outlines the systematic data collection process employed for the creation of a Swahili corpus derived from multiple public websites and reports. The compilation of this corpus involves a meticulous and comprehensive approach to ensure the representation of diverse linguistic contexts and topics relevant to the Swahili language. The data collection process commenced with the identification of suitable sources across various domains, including news articles, health publications, online forums, and Governmental public reports. Websites and platforms with pub licly available Swahili content were systematically crawled and archived to capture a broad spectrum of linguistic expressions. Furthermore, special attention was given to reputable sources to maintain the authenticity of the corpus and linguistic richness. The inclusion of diverse sources ensures that the corpus reflects the linguistic nuances inherent in different contexts and registers within the Swahili language. Additionally, effort s were made to incorporate variations in domain dialects, acknowledging the linguistic diversity present in Swahili. The potential for reusing this Swahili corpus is vast. Researchers, linguists, and language enthusiasts can leverage the diverse and extensive dataset for a multitude of applications, including NLP tasks such as sentiment analysis, textual data clustering, classifications tasks and machine translation. The Corpus can serve as training data for developing and evaluating NLP algorithms, including part-of-speech tagging, and named entity recognition. Also, text mining techniques can be applied to corpus and enable researchers to extract valuable insights, identify patterns, and discover knowledge from large textual datasets. (c) 2024 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY-NC license ( http://creativecommons.org/licenses/by-nc/4.0/ )
引用
收藏
页数:9
相关论文
共 50 条
  • [21] Content Analysis Using Specific Natural Language Processing Methods for Big Data
    Pirnau, Mironela
    Botezatu, Mihai Alexandru
    Priescu, Iustin
    Hosszu, Alexandra
    Tabusca, Alexandru
    Coculescu, Cristina
    Oncioiu, Ionica
    ELECTRONICS, 2024, 13 (03)
  • [22] Accelerating Mixed Methods Research With Natural Language Processing of Big Text Data
    Chang, Tammy
    DeJonckheere, Melissa
    Vydiswaran, V. G. Vinod
    Li, Jiazhao
    Buis, Lorraine R.
    Guetterman, Timothy C.
    JOURNAL OF MIXED METHODS RESEARCH, 2021, 15 (03) : 398 - 412
  • [23] Corpus-based statistical methods in speech and language processing
    Ney, H
    CORPUS-BASED METHODS IN LANGUAGE AND SPEECH PROCESSING, 1997, 2 : 1 - 26
  • [24] Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction
    Santos, C
    Eggle, D
    States, DJ
    BIOINFORMATICS, 2005, 21 (08) : 1653 - 1658
  • [25] Efficient Methods for Natural Language Processing: A Survey
    Treviso, Marcos
    Lee, Ji-Ung
    Ji, Tianchu
    van Aken, Betty
    Cao, Qingqing
    Ciosici, Manuel R.
    Hassid, Michael
    Heafield, Kenneth
    Hooker, Sara
    Raffel, Colin
    Martins, Pedro H.
    Martins, Andre F. T.
    Forde, Jessica Zosa
    Milder, Peter
    Simpson, Edwin
    Slonim, Noam
    Dodge, Jesse
    Strubell, Emma
    Balasubramanian, Niranjan
    Derczynski, Leon
    Gurevych, Iryna
    Schwartz, Roy
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2023, 11 : 826 - 860
  • [26] Efficient Methods for Natural Language Processing: A Survey
    Treviso, Marcos
    Ji, Tianchu
    Lee, Ji-Ung
    van Aken, Betty
    Cao, Qingqing
    R. Ciosici, Manuel
    Hassid, Michael
    Heafield, Kenneth
    Hooker, Sara
    H. Martins, Pedro
    F. T. Martins, Andre
    Milder, Peter
    Raffel, Colin
    Simpson, Edwin
    Slonim, Noam
    Dodge, Jesse
    Strubell, Emma
    Balasubramanian, Niranjan
    Derczynski, Leon
    Gurevych, Iryna
    Schwartz, Roy
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2023, 11 (826-860) : 826 - 860
  • [27] Deep Learning Methods in Natural Language Processing
    Flores, Alexis Stalin Alulema
    APPLIED TECHNOLOGIES (ICAT 2019), PT II, 2020, 1194 : 92 - 107
  • [28] Bayesian Kernel Methods for Natural Language Processing
    Beck, Daniel
    52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: STUDENT RESEARCH WORKSHOP (ACL 2014), 2014, : 1 - 9
  • [29] Statistical Methods of Natural Language Processing on GPU
    Banasiak, Dariusz
    MAN-MACHINE INTERACTIONS 4, ICMMI 2015, 2016, 391 : 595 - 604
  • [30] Applications of Pruning Methods in Natural Language Processing
    Touheed, Marva
    Zubair, Urooj
    Sabir, Dilshad
    Hassan, Ali
    Butt, Muhammad Fasih Uddin
    Riaz, Farhan
    Abdul, Wadood
    Ayub, Rashid
    IEEE ACCESS, 2024, 12 : 89418 - 89438