An Extensible Parsing Pipeline for Unstructured Data Processing

被引:0
|
作者
Jain, Shubham [1 ]
de Buitleir, Amy [2 ]
Fallon, Enda [1 ]
机构
[1] Athlone Inst Technol, Software Res Inst, Athlone, Ireland
[2] Ericsson, Network Management Lab, Athlone, Ireland
关键词
Unsupervised Data Mining; Information Extraction; Clustering; Topic Modeling;
D O I
暂无
中图分类号
TN [电子技术、通信技术];
学科分类号
0809 ;
摘要
Network monitoring and diagnostics systems depict the running system's state and generate enormous amounts of unstructured data through log files, print statements, and other reports. It is not feasible to manually analyze all these files due to limited resources and the need to develop custom parsers to convert unstructured data into desirable file formats. Prior research focuses on rule -based and relationship -based parsing methods to parse unstructured data into structured file formats; these methods are labor-intensive and need large annotated datasets. This paper presents an unsupervised text processing pipeline that analyses such text files, removes extraneous information, identifies tabular components, and parses them into a structured file format. The proposed approach is resilient to changes in the data structure, does not require training data, and is domain-independent. We experiment and compare topic modeling and clustering approaches to verify the accuracy of the proposed technique. Our findings indicate that combining similarity and clustering algorithms to identify data components had better accuracy than topic modeling.
引用
收藏
页码:312 / 318
页数:7
相关论文
共 50 条
  • [1] An Extensible Parsing Pipeline for Unstructured Data Processing
    Jain, Shubham
    de Buitleir, Amy
    Fallon, Enda
    [J]. 2022 24TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY (ICACT): ARITIFLCIAL INTELLIGENCE TECHNOLOGIES TOWARD CYBERSECURITY, 2022, : 312 - +
  • [2] Knowledge Graph Generation for Unstructured Data Using Data Processing Pipeline
    Sukumar, Sushmi Thushara
    Lung, Chung-Horng
    Zaman, Marzia
    [J]. 2023 IEEE 47TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE, COMPSAC, 2023, : 466 - 471
  • [3] A Review of Unstructured Data Analysis and Parsing Methods
    Jain, Shubham
    de Buitleir, Amy
    Fallon, Enda
    [J]. 2020 INTERNATIONAL CONFERENCE ON EMERGING SMART COMPUTING AND INFORMATICS (ESCI), 2020, : 164 - 169
  • [4] Unsupervised Noise Detection in Unstructured data for Automatic Parsing
    Jain, Shubham
    de Buitleir, Amy
    Fallon, Enda
    [J]. 2020 16TH INTERNATIONAL CONFERENCE ON NETWORK AND SERVICE MANAGEMENT (CNSM), 2020,
  • [5] Extensible Query Framework for Unstructured Medical Data - A Big Data Approach
    Istephan, Sarmad
    Siadat, Mohammad-Reza
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOP (ICDMW), 2015, : 455 - 462
  • [6] A Framework for Adaptive Deep Reinforcement Semantic Parsing of Unstructured Data
    Jain, Shubham
    de Buitleir, Amy
    Fallon, Enda
    [J]. 12TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE (ICTC 2021): BEYOND THE PANDEMIC ERA WITH ICT CONVERGENCE INNOVATION, 2021, : 1055 - 1060
  • [7] Analysis and Parsing of Unstructured Cyber-Security Incident Data
    Ochoa, Armando J.
    Finlayson, Mark A.
    [J]. PROCEEDINGS OF THE 2019 CONFERENCE ON SECURITY AND PRIVACY IN WIRELESS AND MOBILE NETWORKS (WISEC '19), 2019, : 345 - 346
  • [8] Skluma: An extensible metadata extraction pipeline for disorganized data
    Skluzacek, Tyler J.
    Kumar, Rohan
    Chard, Ryan
    Harrison, Galen
    Beckman, Paul
    Chard, Kyle
    Foster, Ian T.
    [J]. 2018 IEEE 14TH INTERNATIONAL CONFERENCE ON E-SCIENCE (E-SCIENCE 2018), 2018, : 256 - 266
  • [9] Processing of Unstructured data for Information Extraction
    Ingle, Vaishali A.
    [J]. 3RD NIRMA UNIVERSITY INTERNATIONAL CONFERENCE ON ENGINEERING (NUICONE 2012), 2012,
  • [10] A benchmark suite for unstructured data processing
    Smullen, Clinton Wills
    Tarapore, Shahrukh Rohinton
    Gurumurthi, Sudhanva
    [J]. SNAPI 2007: FOURTH INTERNATIONAL WORKSHOP ON STORAGE NETWORK ARCHITECTURE AND PARALLEL I/OS, PROCEEDINGS, 2007, : 79 - 83