Natural Language or Not (NLoN) - A Package for Software Engineering Text Analysis Pipeline

被引:16
|
作者
Mantyla, Mika V. [1 ]
Calefato, Fabio [2 ]
Claes, Maelick [1 ]
机构
[1] Univ Oulu, M3S, Oulu, Finland
[2] Univ Bari, Dipartimento Jon, Bari, Italy
关键词
natural language processing; preprocessing; filtering; machine learning; regular expressions; character n-grams; glmnet; lasso; logistic regression;
D O I
10.1145/3196398.3196444
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The use of natural language processing (NLP) is gaining popularity in software engineering. In order to correctly perform NLP, we must pre-process the textual information to separate natural language from other information, such as log messages, that are often part of the communication in software engineering. We present a simple approach for classifying whether some textual input is natural language or not. Although our NLoN package relies on only 11 language features and character tri-grams, we are able to achieve an area under the ROC curve performances between 0.976-0.987 on three different data sources, with Lasso regression from Glmnet as our learner and two human raters for providing ground truth. Cross-source prediction performance is lower and has more fluctuation with top ROC performances from 0.913 to 0.980. Compared with prior work, our approach offers similar performance but is considerably more lightweight, making it easier to apply in software engineering text mining pipelines. Our source code and data are provided as an R-package for further improvements.
引用
收藏
页码:387 / 391
页数:5
相关论文
共 50 条
  • [31] Automatic Extraction of Engineering Rules From Unstructured Text: A Natural Language Processing Approach
    Ye, Xinfeng
    Lu, Yuqian
    [J]. JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2020, 20 (03)
  • [32] Microstructural analysis software package
    Khomenko, A. I.
    Khomenko, E. V.
    [J]. POWDER METALLURGY AND METAL CERAMICS, 2007, 46 (1-2) : 100 - 104
  • [33] Microstructural analysis software package
    A. I. Khomenko
    E. V. Khomenko
    [J]. Powder Metallurgy and Metal Ceramics, 2007, 46 : 100 - 104
  • [34] A Natural Language Processing Pipeline of Chinese Free-Text Radiology Reports for Liver Cancer Diagnosis
    Liu, Honglei
    Xu, Yan
    Zhang, Zhiqiang
    Wang, Ni
    Huang, Yanqun
    Hu, Yanjun
    Yang, Zhenghan
    Jiang, Rui
    Chen, Hui
    [J]. IEEE ACCESS, 2020, 8 : 159110 - 159119
  • [35] Natural Language Processing Pipeline for Temporal Information Extraction and Classification from Free Text Eligibility Criteria
    Parthasarathy, Gayathri
    Olmsted, Aspen
    Anderson, Paul
    [J]. INTERNATIONAL CONFERENCE ON INFORMATION SOCIETY (I-SOCIETY 2016), 2016, : 120 - 121
  • [36] An Ontology-Enabled Natural Language Processing Pipeline for Provenance Metadata Extraction from Biomedical Text
    Valdez, Joshua
    Rueschman, Michael
    Kim, Matthew
    Redline, Susan
    Sahoo, Satya S.
    [J]. ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS: OTM 2016 CONFERENCES, 2016, 10033 : 699 - 708
  • [37] A natural language processing pipeline for pairing measurements uniquely across free-text CT reports
    Sevenster, Merlijn
    Bozeman, Jeffrey
    Cowhy, Andrea
    Trost, William
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2015, 53 : 36 - 48
  • [38] Identifying causality and contributory factors of pipeline incidents by employing natural language processing and text mining techniques
    Liu, Guanyang
    Boyd, Mason
    Yu, Mengxi
    Halim, S. Zohra
    Quddus, Noor
    [J]. PROCESS SAFETY AND ENVIRONMENTAL PROTECTION, 2021, 152 : 37 - 46
  • [39] Leveraging Natural Language Analysis of Software: Achievements, Challenges, and Opportunities
    Pollock, Lori
    [J]. 2012 28TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE (ICSM), 2012, : 4 - 4
  • [40] Tool support for Domain Analysis of the Software Specification in Natural Language
    Omori, Yoichi
    Araki, Keijiro
    [J]. TENCON 2010: 2010 IEEE REGION 10 CONFERENCE, 2010, : 1065 - 1070