Speculative Distributed CSV Data Parsing for Big Data Analytics

被引:12
|
作者
Ge, Chang [1 ,2 ]
Li, Yinan [1 ]
Eilebrecht, Eric [1 ]
Chandramouli, Badrish [1 ]
Kossmann, Donald [1 ]
机构
[1] Microsoft Res, Redmond, WA USA
[2] Univ Waterloo, Waterloo, ON, Canada
关键词
RAW DATA;
D O I
10.1145/3299869.3319898
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
There has been a recent flurry of interest in providing query capability on raw data in today's big data systems. These raw data must be parsed before processing or use in analytics. Thus, a fundamental challenge in distributed big data systems is that of efficient parallel parsing of raw data. The difficulties come from the inherent ambiguity while independently parsing chunks of raw data without knowing the context of these chunks. Specifically, it can be difficult to find the beginnings and ends of fields and records in these chunks of raw data. To parallelize parsing, this paper proposes a speculation-based approach for the CSV format, arguably the most commonly used raw data format. Due to the syntactic and statistical properties of the format, speculative parsing rarely fails and therefore parsing is efficiently parallelized in a distributed setting. Our speculative approach is also robust, meaning that it can reliably detect syntax errors in CSV data. We experimentally evaluate the speculative, distributed parsing approach in Apache Spark using more than 11,000 real-world datasets, and show that our parser produces significant performance benefits over existing methods.
引用
收藏
页码:883 / 899
页数:17
相关论文
共 50 条
  • [1] Distributed Analytics For Big Data: A Survey
    Berloco, Francesco
    Bevilacqua, Vitoantonio
    Colucci, Simona
    NEUROCOMPUTING, 2024, 574
  • [2] An algebra for distributed Big Data analytics
    Fegaras, Leonidas
    JOURNAL OF FUNCTIONAL PROGRAMMING, 2017, 27
  • [3] A Distributed Big Data Analytics Architecture for Vehicle Sensor Data
    Alexakis, Theodoros
    Peppes, Nikolaos
    Demestichas, Konstantinos
    Adamopoulou, Evgenia
    SENSORS, 2023, 23 (01)
  • [4] Distributed Big Data Analytics in the Internet of Signals
    Anavangot, Vijay
    Menon, Varun G.
    Nayyar, Anand
    PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON SYSTEM MODELING & ADVANCEMENT IN RESEARCH TRENDS (SMART), 2018, : 73 - 77
  • [5] Distributed algorithm for big data analytics in healthcare
    Forestiero, Agostino
    Papuzzo, Giuseppe
    2018 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2018), 2018, : 776 - 779
  • [6] Distributed Big Data Analytics in Service Computing
    Yu, Weider D.
    Gottumukkala, AvinashChander
    Senthailselvi, Deenash Arivazhagan
    Maniraj, Prabhu
    Khonde, Tushar
    2017 IEEE 13TH INTERNATIONAL SYMPOSIUM ON AUTONOMOUS DECENTRALIZED SYSTEMS (ISADS 2017), 2017, : 55 - 60
  • [7] Distributed data networks: a blueprint for Big Data sharing and healthcare analytics
    Popovic, Jennifer R.
    ANNALS OF THE NEW YORK ACADEMY OF SCIENCES, 2017, 1387 (01) : 105 - 111
  • [8] Performance Enhancement of Distributed Clustering for Big Data Analytics
    Mohamed, Omar Hesham
    Shehab, Mohamed Elemam
    El Fakharany, Essam
    INTERNATIONAL CONFERENCE ON ADVANCED MACHINE LEARNING TECHNOLOGIES AND APPLICATIONS (AMLTA2018), 2018, 723 : 415 - 425
  • [9] BlueDBM: Distributed Flash Storage for Big Data Analytics
    Jun, Sang-Woo
    Liu, Ming
    Lee, Sungjin
    Hicks, Jamey
    Ankcorn, John
    King, Myron
    Xu, Shuotao
    Arvind
    ACM TRANSACTIONS ON COMPUTER SYSTEMS, 2016, 34 (03):
  • [10] A Distributed Computing Platform for fMRI Big Data Analytics
    Makkie, Milad
    Li, Xiang
    Quinn, Shannon
    Lin, Binbin
    Ye, Jieping
    Mon, Geoffrey
    Liu, Tianming
    IEEE TRANSACTIONS ON BIG DATA, 2019, 5 (02) : 109 - 119