A Methodological Framework for Creating Large-Scale Corpus for Natural Language Processing Models

被引:0
|
作者
Santos, David [1 ,4 ]
Auquilla, Andres [2 ,4 ]
Siguenza-Guzman, Lorena [2 ,3 ,4 ]
Pena, Mario [3 ,4 ]
机构
[1] Univ Cuenca, Fac Engn, Cuenca 010107, Ecuador
[2] Univ Cuenca, Dept Comp Sci, Fac Engn, Cuenca 010107, Ecuador
[3] Katholieke Univ Leuven, Res Ctr Accountancy, Fac Econ & Business, Leuven, Belgium
[4] Univ Cuenca, Res Dept DIUC, Cuenca 010107, Ecuador
关键词
Corpus construction; Corpus in Spanish; Large-scale corpus; Methodological framework; Supplies for NLP;
D O I
10.1007/978-3-030-89941-7_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Currently, there is a boom in introducing Machine Learning models to various aspects of everyday life. A relevant field consists of Natural Language Processing (NLP) that seeks to model human language. A key and basic component for these models to learn properly consists of the data. This article proposes a methodological framework for constructing a large-scale corpus to feed NLP models. The development of this framework emerges from the problem of finding inputs in languages other than English to feed NLP models. With an approach focused on producing a high-quality resource, the construction phases were designed along with the considerations that must be taken. The stages implemented consist of the corpus characterization to be obtained, collecting documents, cleaning, translation, storage, and evaluation. The proposed approach implemented automatic translators to take advantage of the vast amount of English literature and implemented through non-cost libraries. Finally, a case study was developed, resulting in a corpus in Spanish with more than 170,000 documents within a specific domain, i.e., opinions on textile products. Through the evaluations carried out, it is established that the proposed framework can build a large-scale and high-quality corpus.
引用
收藏
页码:87 / 100
页数:14
相关论文
共 50 条
  • [1] Natural Language Processing in Large-Scale Neural Models for Medical Screenings
    Stille, Catharina Marie
    Bekolay, Trevor
    Blouw, Peter
    Kroeger, Bernd J.
    [J]. FRONTIERS IN ROBOTICS AND AI, 2019, 6
  • [2] Large-scale photonic natural language processing
    Valensise, Carlo M.
    Grecco, Ivana
    Perangeli, Davide
    Conti, Laudio
    [J]. PHOTONICS RESEARCH, 2022, 10 (12) : 2846 - 2853
  • [3] Large-scale photonic natural language processing
    CARLO M.VALENSISE
    IVANA GRECCO
    DAVIDE PIERANGELI
    CLAUDIO C
    [J]. Photonics Research, 2022, 10 (12) : 2846 - 2853
  • [4] Large Language Models are Not Models of Natural Language: They are Corpus Models
    Veres, Csaba
    [J]. IEEE ACCESS, 2022, 10 : 61970 - 61979
  • [5] Extracting answers to natural language questions from large-scale corpus
    Li, P
    Wang, XL
    Guan, Y
    Zhao, YM
    [J]. PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 690 - 694
  • [6] Assessing the feasibility of large-scale natural language processing in a corpus of ordinary medical records: A lexical analysis
    Hersh, WR
    Campbell, EM
    Malveau, SE
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 1997, : 580 - 584
  • [8] Large-scale Point-of-Interest Category Prediction Using Natural Language Processing Models
    Zhang, Daniel
    Wang, Dong
    Zheng, Hao
    Mu, Xin
    Li, Qi
    Zhang, Yang
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 1027 - 1032
  • [9] Optimizing Resource Allocation in Cloud for Large-Scale Deep Learning Models in Natural Language Processing
    Dhopavkar, Gauri
    Welekar, Rashmi R.
    Ingole, Piyush K.
    Vaidya, Chandu
    Wankhade, Shalini Vaibhav
    Vasgi, Bharati P.
    [J]. JOURNAL OF ELECTRICAL SYSTEMS, 2023, 19 (03) : 62 - 77
  • [10] A corpus-based connectionist architecture for large-scale natural language parsing
    Tepper, JA
    Powell, HM
    Palmer-Brown, D
    [J]. CONNECTION SCIENCE, 2002, 14 (02) : 93 - 114