A machine-learning approach to automatic detection of delimiters in tabular data files

被引:0
|
作者
Saurav, Shitesh [1 ]
Schwarz, Peter [2 ]
机构
[1] Univ Southern Calif, Viterbi Sch Engn, Los Angeles, CA 90007 USA
[2] IBM Res Almaden, San Jose, CA USA
关键词
data ingestion; delimiters; logistic regression;
D O I
10.1109/HPCC-SmartCity-DSS.2016.41
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Detection of string and column delimiters is a critical first step in the automated ingestion of files containing tabular data. In this paper we present an algorithm that uses a logistic-regression classifier to evaluate whether a particular choice of delimiters is correct. The delimiter choice that is given the highest score by the classifier is chosen as the one most likely to be correct. The algorithm makes the correct choice over 90% of the time on a test data set of files with a variety of different delimiters.
引用
下载
收藏
页码:1501 / 1503
页数:3
相关论文
共 50 条
  • [21] Reconciling schemas of disparate data sources: A machine-learning approach
    Doan, AH
    Domingos, P
    Halevy, A
    SIGMOD RECORD, 2001, 30 (02) : 509 - 520
  • [22] Drug repositioning: a machine-learning approach through data integration
    Napolitano, Francesco
    Zhao, Yan
    Moreira, Vania M.
    Tagliaferri, Roberto
    Kere, Juha
    D'Amato, Mauro
    Greco, Dario
    JOURNAL OF CHEMINFORMATICS, 2013, 5
  • [23] A hybrid machine-learning approach for segmentation of protein localization data
    Kasson, PM
    Huppa, JB
    Davis, MM
    Brunger, AT
    BIOINFORMATICS, 2005, 21 (19) : 3778 - 3786
  • [24] Automatic Classification of Galaxy Morphology: A Rotationally-invariant Supervised Machine-learning Method Based on the Unsupervised Machine-learning Data Set
    Fang, GuanWen
    Ba, Shuo
    Gu, Yizhou
    Lin, Zesen
    Hou, Yuejie
    Qin, Chenxin
    Zhou, Chichun
    Xu, Jun
    Dai, Yao
    Song, Jie
    Kong, Xu
    ASTRONOMICAL JOURNAL, 2023, 165 (02):
  • [25] Methods for Automatic Machine-Learning Workflow Analysis
    Wendlinger, Lorenz
    Berndl, Emanuel
    Granitzer, Michael
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2021: APPLIED DATA SCIENCE TRACK, PT V, 2021, 12979 : 52 - 67
  • [26] Machine-Learning Framework for Automatic Netlist Creation
    Badreddine, Mohamed
    Blaquiere, Yves
    Boukadoum, Mounir
    2011 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2011, : 2865 - 2868
  • [27] Machine-Learning Approach for Automatic Detection of Wild Beluga Whales from Hand-Held Camera Pictures
    Araujo, Voncarlos M.
    Shukla, Ankita
    Chion, Clement
    Gambs, Sebastien
    Michaud, Robert
    SENSORS, 2022, 22 (11)
  • [28] Automatic Detection of Faults in Race Walking: A Comparative Analysis of Machine-Learning Algorithms Fed with Inertial Sensor Data
    Taborri, Juri
    Palermo, Eduardo
    Rossi, Stefano
    SENSORS, 2019, 19 (06)
  • [29] Automatic void content assessment of composite laminates using a machine-learning approach
    Machado, Joao M.
    Trvares, Joao Manuel R. S. M.
    Camanho, Pedro P.
    Correia, Nuno
    COMPOSITE STRUCTURES, 2022, 288
  • [30] A Graph Machine Learning approach to Automatic Dementia Detection
    Stoppa, Edoardo
    Di Donato, Guido Walter
    Poles, Isabella
    D'Arnese, Eleonora
    Parde, Natalie
    Santambrogio, Marco Domenico
    2023 IEEE EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS, BHI, 2023,