TextCL: A Python']Python package for NLP preprocessing tasks

被引:2
|
作者
Petukhova, Alina [1 ]
Fachada, Nuno [1 ]
机构
[1] Lusofona Univ, COPELABS, Campo Grande 376, Lisbon, Portugal
关键词
Natural language processing; Text filtering; Outlier detection;
D O I
10.1016/j.softx.2022.101122
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Preprocessing text data sets for use in Natural Language Processing tasks is usually a time-consuming and expensive effort. Text data, normally obtained from sources such as, but not limited to, web scraping, scanned documents or PDF files, is typically unstructured and prone to artifacts and other types of noise. The goal of the TextCL package is to simplify this process by providing multiple methods suited for text data preprocessing. It includes functionality for splitting texts into sentences, filtering sentences by language, perplexity filtering, and removing duplicate sentences. Another functionality offered by the TextCL package is the outlier detection module, which allows to identify and filter out texts that are different from the main topic distribution of the data set. This method allows selecting one of several unsupervised outlier detection algorithms, such as TONMF (block coordinate descent framework), RPCA (robust principal component analysis), or SVD (singular value decomposition) and apply it to the text data. (C) 2022 The Author(s). Published by Elsevier B.V.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] PTRAIL - A python']python package for parallel trajectory data preprocessing
    Haidri, Salman
    Haranwala, Yaksh J.
    Bogorny, Vania
    Renso, Chiara
    da Fonseca, Vinicius Prado
    Soares, Amilcar
    [J]. SOFTWAREX, 2022, 19
  • [2] MVTS-Data Toolkit: A Python']Python package for preprocessing multivariate time series data
    Ahmadzadeh, Azim
    Sinha, Kankana
    Aydin, Berkay
    Angryk, Rafal A.
    [J]. SOFTWAREX, 2020, 12
  • [3] PyCantonese: Cantonese Linguistics and NLP in Python']Python
    Lee, Jackson L.
    Chen, Litong
    Lam, Charles
    Lau, Chaak Ming
    Tsui, Tsz-Him
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6607 - 6611
  • [4] SUREHYP: An Open Source Python']Python Package for Preprocessing Hyperion Radiance Data and Retrieving Surface Reflectance
    Miraglio, Thomas
    Coops, Nicholas C.
    [J]. SENSORS, 2022, 22 (23)
  • [5] What's Wrong, Python']Python? - A Visual Differ and Graph Library for NLP in Python']Python
    Indig, Balazs
    Simonyi, Andras
    Ligeti-Nagy, Noemi
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 577 - 582
  • [7] SurvLIMEpy: A Python']Python package implementing SurvLIME
    Pachon-Garcia, Cristian
    Hernandez-Perez, Carlos
    Delicado, Pedro
    Vilaplana, Veronica
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 237
  • [8] tension: A Python']Python package for FORCE learning
    Liu, Lu Bin
    Losonczy, Attila
    Liao, Zhenrui
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2022, 18 (12)
  • [9] A Python']Python package for particle physics analyses
    Bevan, Adrian
    Charman, Thomas
    Hays, Jonathan
    [J]. 23RD INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP 2018), 2019, 214
  • [10] danRerLib: a Python']Python package for zebrafish transcriptomics
    Schwartz, Ashley, V
    Sant, Karilyn E.
    George, Uduak Z.
    [J]. BIOINFORMATICS ADVANCES, 2024, 4 (01):