Speculative Distributed CSV Data Parsing for Big Data Analytics

被引:12
|
作者
Ge, Chang [1 ,2 ]
Li, Yinan [1 ]
Eilebrecht, Eric [1 ]
Chandramouli, Badrish [1 ]
Kossmann, Donald [1 ]
机构
[1] Microsoft Res, Redmond, WA USA
[2] Univ Waterloo, Waterloo, ON, Canada
关键词
RAW DATA;
D O I
10.1145/3299869.3319898
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
There has been a recent flurry of interest in providing query capability on raw data in today's big data systems. These raw data must be parsed before processing or use in analytics. Thus, a fundamental challenge in distributed big data systems is that of efficient parallel parsing of raw data. The difficulties come from the inherent ambiguity while independently parsing chunks of raw data without knowing the context of these chunks. Specifically, it can be difficult to find the beginnings and ends of fields and records in these chunks of raw data. To parallelize parsing, this paper proposes a speculation-based approach for the CSV format, arguably the most commonly used raw data format. Due to the syntactic and statistical properties of the format, speculative parsing rarely fails and therefore parsing is efficiently parallelized in a distributed setting. Our speculative approach is also robust, meaning that it can reliably detect syntax errors in CSV data. We experimentally evaluate the speculative, distributed parsing approach in Apache Spark using more than 11,000 real-world datasets, and show that our parser produces significant performance benefits over existing methods.
引用
收藏
页码:883 / 899
页数:17
相关论文
共 50 条
  • [31] Big Data Analytics, Data Science and the CIS
    Yao, Xin
    IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, 2015, 10 (01) : 4 - 5
  • [32] Data stream classification and big data analytics
    Krawczyk, Bartosz
    Wozniak, Michal
    Stefanowski, Jerzy
    NEUROCOMPUTING, 2015, 150 : 238 - 239
  • [33] Process Data Analytics in the Era of Big Data
    Qin, S. Joe
    AICHE JOURNAL, 2014, 60 (09) : 3092 - 3100
  • [34] Big Data Infrastructure for Aviation Data Analytics
    Murugan, Anandavel
    Mylaraswamy, Dinkar
    Xu, Brian
    Dietrich, Paul
    2014 IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING IN EMERGING MARKETS (CCEM), 2014, : 87 - 92
  • [35] AGRICULTURAL DATA ANALYTICS - SMALL TO BIG DATA
    Ravichandran, S.
    Kareemulla, K.
    INTERNATIONAL JOURNAL OF AGRICULTURAL AND STATISTICAL SCIENCES, 2018, 14 (01): : 211 - 214
  • [36] Big data analytics: transforming data to action
    Bumblauskas, Daniel
    Nold, Herb
    Bumblauskas, Paul
    Igou, Amy
    BUSINESS PROCESS MANAGEMENT JOURNAL, 2017, 23 (03) : 703 - 720
  • [37] Distributed Big Data Computing for Supporting Predictive Analytics of Service Requests
    Wang, Tianlei
    Harvey, James D.
    Leung, Carson K.
    Pazdor, Adam G. M.
    Chauhan, Animesh Singh
    Fan, Lihe
    Cuzzocrea, Alfredo
    2021 IEEE 45TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2021), 2021, : 1723 - 1728
  • [38] Collective Anomaly Detection Using Big Data Distributed Stream Analytics
    Amen, Bakhtiar
    Grigoris, Antoniou
    2018 14TH INTERNATIONAL CONFERENCE ON SEMANTICS, KNOWLEDGE AND GRIDS (SKG), 2018, : 188 - 195
  • [39] Software readiness for data analytics and Big Data
    Cox, Travis
    Control Engineering, 2020, 67 (03) : 20 - 21
  • [40] Zero-Change Object Transmission for Distributed Big Data Analytics
    Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University, China
    不详
    不详
    Proc. USENIX Annu. Tech. Conf., ATC, (137-150): : 137 - 150