TextCL: A Python']Python package for NLP preprocessing tasks

被引:2
|
作者
Petukhova, Alina [1 ]
Fachada, Nuno [1 ]
机构
[1] Lusofona Univ, COPELABS, Campo Grande 376, Lisbon, Portugal
关键词
Natural language processing; Text filtering; Outlier detection;
D O I
10.1016/j.softx.2022.101122
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Preprocessing text data sets for use in Natural Language Processing tasks is usually a time-consuming and expensive effort. Text data, normally obtained from sources such as, but not limited to, web scraping, scanned documents or PDF files, is typically unstructured and prone to artifacts and other types of noise. The goal of the TextCL package is to simplify this process by providing multiple methods suited for text data preprocessing. It includes functionality for splitting texts into sentences, filtering sentences by language, perplexity filtering, and removing duplicate sentences. Another functionality offered by the TextCL package is the outlier detection module, which allows to identify and filter out texts that are different from the main topic distribution of the data set. This method allows selecting one of several unsupervised outlier detection algorithms, such as TONMF (block coordinate descent framework), RPCA (robust principal component analysis), or SVD (singular value decomposition) and apply it to the text data. (C) 2022 The Author(s). Published by Elsevier B.V.
引用
收藏
页数:6
相关论文
共 50 条
  • [21] PYCHEM: a multivariate analysis package for python']python
    Jarvis, Roger M.
    Broadhurst, David
    Johnson, Helen
    O'Boyle, Noel M.
    Goodacre, Royston
    [J]. BIOINFORMATICS, 2006, 22 (20) : 2565 - 2566
  • [22] matplotlib - A portable python']python plotting package
    Barrett, P
    Hunter, J
    Miller, JT
    Hsu, JC
    Greenfield, P
    [J]. Astronomical Data Analysis Software and Systems XIV, Proceedings, 2005, 347 : 91 - 95
  • [23] PsychRNN: An Accessible and Flexible Python']Python Package for Training Recurrent Neural Network Models on Cognitive Tasks
    Ehrlich, Daniel B.
    Stone, Jasmine T.
    Brandfonbrener, David
    Atanasov, Alexander
    Murray, John D.
    [J]. ENEURO, 2021, 8 (01) : 1 - 11
  • [24] GDPS: an open-source python']python-based software package for multi-GNSS data preprocessing
    Lu, Liguo
    Hu, Weijian
    Wu, Tangting
    [J]. GPS SOLUTIONS, 2024, 28 (03)
  • [25] Python']Python tools for structural tasks in chemistry
    Ryzhkov, Fedor V.
    Ryzhkova, Yuliya E.
    Elinson, Michail N.
    [J]. MOLECULAR DIVERSITY, 2024,
  • [26] dingo: a Python']Python package for metabolic flux sampling
    Chalkis, Apostolos
    Fisikopoulos, Vissarion
    Tsigaridas, Elias
    Zafeiropoulos, Haris
    [J]. BIOINFORMATICS ADVANCES, 2024, 4 (01):
  • [27] A Python']Python upgrade to the GooFit package for parallel fitting
    Schreiner, Henry
    Pandey, Himadri
    Sokoloff, Michael D.
    Hittle, Bradley
    Tomko, Karen
    Hasse, Christoph
    [J]. 23RD INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP 2018), 2019, 214
  • [28] pyjeo: A Python']Python Package for the Analysis of Geospatial Data
    Kempeneers, Pieter
    Pesek, Ondrej
    De Marchi, Davide
    Soille, Pierre
    [J]. ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2019, 8 (10)
  • [29] pymetamodels: A Python']Python package for metamodeling and design automation
    Escribano, Nicolas
    Bielsa, Jose Manuel
    Lahuerta, Francisco
    [J]. SOFTWAREX, 2024, 26
  • [30] ADOpy: a python']python package for adaptive design optimization
    Yang, Jaeyeong
    Pitt, Mark A.
    Ahn, Woo-Young
    Myung, Jay I.
    [J]. BEHAVIOR RESEARCH METHODS, 2021, 53 (02) : 874 - 897