Natural Language or Not (NLoN) - A Package for Software Engineering Text Analysis Pipeline

被引:16
|
作者
Mantyla, Mika V. [1 ]
Calefato, Fabio [2 ]
Claes, Maelick [1 ]
机构
[1] Univ Oulu, M3S, Oulu, Finland
[2] Univ Bari, Dipartimento Jon, Bari, Italy
关键词
natural language processing; preprocessing; filtering; machine learning; regular expressions; character n-grams; glmnet; lasso; logistic regression;
D O I
10.1145/3196398.3196444
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The use of natural language processing (NLP) is gaining popularity in software engineering. In order to correctly perform NLP, we must pre-process the textual information to separate natural language from other information, such as log messages, that are often part of the communication in software engineering. We present a simple approach for classifying whether some textual input is natural language or not. Although our NLoN package relies on only 11 language features and character tri-grams, we are able to achieve an area under the ROC curve performances between 0.976-0.987 on three different data sources, with Lasso regression from Glmnet as our learner and two human raters for providing ground truth. Cross-source prediction performance is lower and has more fluctuation with top ROC performances from 0.913 to 0.980. Compared with prior work, our approach offers similar performance but is considerably more lightweight, making it easier to apply in software engineering text mining pipelines. Our source code and data are provided as an R-package for further improvements.
引用
收藏
页码:387 / 391
页数:5
相关论文
共 50 条
  • [21] Towards Utilizing Natural Language Processing Techniques to Assist in Software Engineering Tasks
    Ding, Zishuo
    [J]. 2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: COMPANION PROCEEDINGS, ICSE-COMPANION, 2023, : 286 - 290
  • [22] A Systematic Literature Review on Using Natural Language Processing in Software Requirements Engineering
    Necula, Sabina-Cristiana
    Dumitriu, Florin
    Greavu-Serban, Valerica
    [J]. ELECTRONICS, 2024, 13 (11)
  • [23] A Comparison of Dictionary Building Methods for Sentiment Analysis in Software Engineering Text
    Islam, Md Rakibul
    Zibran, Minhaz F.
    [J]. 11TH ACM/IEEE INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING AND MEASUREMENT (ESEM 2017), 2017, : 478 - 479
  • [24] Exploiting the Unique Expression for Improved Sentiment Analysis in Software Engineering Text
    Sun, Kexin
    Gao, Hui
    Kuang, Hongyu
    Ma, Xiaoxing
    Rong, Guoping
    Shao, Dong
    Zhang, He
    [J]. 2021 IEEE/ACM 29TH INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2021), 2021, : 149 - 159
  • [25] Semantic Analysis and Natural Language Text Search for Internet Portal
    Kovaliuk, Tetiana
    Kobets, Nataliya
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT SYSTEMS (COLINS-2019), VOL I: MAIN CONFERENCE, 2019, 2362 : 277 - 287
  • [26] A Toolkit for Text Extraction and Analysis for Natural Language Processing Tasks
    Sefara, Tshephisho Joseph
    Mbooi, Mahlatse
    Mashile, Katlego
    Rambuda, Thompho
    Rangata, Mapitsi
    [J]. 5TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, BIG DATA, COMPUTING AND DATA COMMUNICATION SYSTEMS (ICABCD2022), 2022,
  • [27] LANGUAGE-ANALYSIS PROBLEMS IN COMPUTER PROCESSING OF NATURAL TEXT
    CLIMENSON, WD
    [J]. IEEE TRANSACTIONS ON ENGINEERING WRITING AND SPEECH, 1963, EWS6 (02): : 72 - &
  • [28] An Analysis of Natural Language Text Relating to Thai Criminal Law
    Krungklang, Weerayut
    Sinthupinyo, Sukree
    [J]. PROCEEDINGS OF THE 2020 12TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTERS AND ARTIFICIAL INTELLIGENCE (ECAI-2020), 2020,
  • [29] Testing the Use of Natural Language Processing Software and Content Analysis to Analyze Nursing Hand-off Text Data
    Galatzan, Benjamin J.
    Carrington, Jane M.
    Gephart, Sheila
    [J]. CIN-COMPUTERS INFORMATICS NURSING, 2021, 39 (08) : 411 - 417
  • [30] LINGUISTIC ANALYSIS OF NATURAL LANGUAGE ENGINEERING REQUIREMENT STATEMENTS
    Lamar, Carl
    Mocko, Gregory M.
    [J]. TOOLS AND METHODS OF COMPETITIVE ENGINEERING, VOLS 1-2, 2010, : 97 - 111