A machine-learning approach to automatic detection of delimiters in tabular data files

被引:0
|
作者
Saurav, Shitesh [1 ]
Schwarz, Peter [2 ]
机构
[1] Univ Southern Calif, Viterbi Sch Engn, Los Angeles, CA 90007 USA
[2] IBM Res Almaden, San Jose, CA USA
关键词
data ingestion; delimiters; logistic regression;
D O I
10.1109/HPCC-SmartCity-DSS.2016.41
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Detection of string and column delimiters is a critical first step in the automated ingestion of files containing tabular data. In this paper we present an algorithm that uses a logistic-regression classifier to evaluate whether a particular choice of delimiters is correct. The delimiter choice that is given the highest score by the classifier is chosen as the one most likely to be correct. The algorithm makes the correct choice over 90% of the time on a test data set of files with a variety of different delimiters.
引用
收藏
页码:1501 / 1503
页数:3
相关论文
共 50 条
  • [1] Automatic Machine Learning-Based OLAP Measure Detection for Tabular Data
    Yang, Yuzhao
    Abdelhedi, Fatma
    Darmont, Jerome
    Ravat, Franck
    Teste, Olivier
    BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY, DAWAK 2022, 2022, 13428 : 173 - 188
  • [2] Similarity detection among data files - A machine learning approach
    Dash, M
    Liu, H
    1997 IEEE KNOWLEDGE AND DATA ENGINEERING EXCHANGE WORKSHOP, PROCEEDINGS, 1997, : 172 - 179
  • [3] ShinvLearner: A containerized benchmarking tool for machine-learning classification of tabular data
    Piccolo, Stephen R.
    Lee, Terry J.
    Suh, Erica
    Hill, Kimball
    GIGASCIENCE, 2020, 9 (04):
  • [4] Detection of packaged and encrypted PE files with the use of machine-learning algorithm
    Gevorgyan, R. A.
    Abramov, E. S.
    11TH INTERNATIONAL CONFERENCE ON SECURITY OF INFORMATION AND NETWORKS (SIN 2018), 2018,
  • [5] Automatic Detection of Large-scale Flux Ropes and Their Geoeffectiveness with a Machine-learning Approach
    Pal, Sanchita
    dos Santos, Luiz F. G.
    Weiss, Andreas J.
    Narock, Thomas
    Narock, Ayris
    Nieves-Chinchilla, Teresa
    Jian, Lan K.
    Good, Simon W.
    ASTROPHYSICAL JOURNAL, 2024, 972 (01):
  • [6] A Machine-Learning Approach for Automatic Grape-Bunch Detection Based on Opponent Colors
    Bruni, Vittoria
    Dominijanni, Giulia
    Vitulano, Domenico
    SUSTAINABILITY, 2023, 15 (05)
  • [7] A Machine-learning based Unbiased Phishing Detection Approach
    Shirazi, Hossein
    Zweigle, Landon
    Ray, Indrakshi
    PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON E-BUSINESS AND TELECOMMUNICATIONS (SECRYPT), VOL 1, 2020, : 423 - 430
  • [8] A Machine-Learning Approach for Detection and Quantification of QRS Fragmentation
    Goovaerts, Griet
    Padhy, Sibasankar
    Vandenberk, Bert
    Varon, Carolina
    Willems, Rik
    Van Huffel, Sabine
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2019, 23 (05) : 1980 - 1989
  • [9] Machine-Learning Approach to Analysis of Driving Simulation Data
    Yoshizawa, Akira
    Nishiyama, Hiroyuki
    Iwasaki, Hirotoshi
    Mizoguchi, Fumio
    2016 IEEE 15TH INTERNATIONAL CONFERENCE ON COGNITIVE INFORMATICS & COGNITIVE COMPUTING (ICCI*CC), 2016, : 398 - 402
  • [10] Road-Deterioration Detection using Road Vibration Data with Machine-Learning Approach
    Takanashi, Masaki
    Ishii, Yoshinao
    Sato, Shu-ichi
    Sano, Noriyoshi
    Sanda, Katsushi
    2020 IEEE INTERNATIONAL CONFERENCE ON PROGNOSTICS AND HEALTH MANAGEMENT (ICPHM), 2020,