Speculative Distributed CSV Data Parsing for Big Data Analytics

被引:12
|
作者
Ge, Chang [1 ,2 ]
Li, Yinan [1 ]
Eilebrecht, Eric [1 ]
Chandramouli, Badrish [1 ]
Kossmann, Donald [1 ]
机构
[1] Microsoft Res, Redmond, WA USA
[2] Univ Waterloo, Waterloo, ON, Canada
关键词
RAW DATA;
D O I
10.1145/3299869.3319898
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
There has been a recent flurry of interest in providing query capability on raw data in today's big data systems. These raw data must be parsed before processing or use in analytics. Thus, a fundamental challenge in distributed big data systems is that of efficient parallel parsing of raw data. The difficulties come from the inherent ambiguity while independently parsing chunks of raw data without knowing the context of these chunks. Specifically, it can be difficult to find the beginnings and ends of fields and records in these chunks of raw data. To parallelize parsing, this paper proposes a speculation-based approach for the CSV format, arguably the most commonly used raw data format. Due to the syntactic and statistical properties of the format, speculative parsing rarely fails and therefore parsing is efficiently parallelized in a distributed setting. Our speculative approach is also robust, meaning that it can reliably detect syntax errors in CSV data. We experimentally evaluate the speculative, distributed parsing approach in Apache Spark using more than 11,000 real-world datasets, and show that our parser produces significant performance benefits over existing methods.
引用
收藏
页码:883 / 899
页数:17
相关论文
共 50 条
  • [21] Situated Big Data and Big Data Analytics for Healthcare
    Sterling, Mark
    2017 IEEE GLOBAL HUMANITARIAN TECHNOLOGY CONFERENCE (GHTC), 2017,
  • [22] Big data analytics and business analytics
    Duan, Lian
    Xiong, Ye
    JOURNAL OF MANAGEMENT ANALYTICS, 2015, 2 (01) : 1 - 21
  • [23] Using Distributed Data over HBase in Big Data Analytics Platform for Clinical Services
    Chrimes, Dillon
    Zamani, Hamid
    COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE, 2017, 2017
  • [24] Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics
    Ramakrishnan, Raghu
    Sridharan, Baskar
    Douceur, John R.
    Kasturi, Pavan
    Krishnamachari-Sampath, Balaji
    Krishnamoorthy, Karthick
    Li, Peng
    Manu, Mitica
    Michaylov, Spiro
    Ramos, Rogerio
    Sharman, Neil
    Xu, Zee
    Barakat, Youssef
    Douglas, Chris
    Draves, Richard
    Naidu, Shrikant S.
    Shastry, Shankar
    Sikaria, Atul
    Sun, Simon
    Venkatesan, Ramarathnam
    SIGMOD'17: PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2017, : 51 - 63
  • [25] Protagonist of Big Data and Predictive Analytics using data analytics
    Subbalakshmi, Sakineti
    Prabhu, C. S. R.
    PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON COMPUTATIONAL TECHNIQUES, ELECTRONICS AND MECHANICAL SYSTEMS (CTEMS), 2018, : 276 - 279
  • [26] Introduction to big data and analytics: Pathways to maturity the original big data and analytics minitrack
    Kaisler, Stephen H.
    Armour, Frank J.
    Espinosa, J. Alberto
    Proceedings of the Annual Hawaii International Conference on System Sciences, 2020, 2020-January : 940 - 942
  • [27] Introduction to big data and analytics: Pathways to maturity the original big data and analytics minitrack
    Kaisler, Stephen H.
    Armour, Frank J.
    Espinosa, J. Alberto
    Proceedings of the Annual Hawaii International Conference on System Sciences, 2021, 2020-January : 936 - 939
  • [28] Big data: Evaluation criteria for big data analytics technologies
    Muchemwa, Regis
    de la Harpe, Andre
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON BUSINESS AND MANAGEMENT DYNAMICS 2016: SUSTAINABLE ECONOMIES IN THE INFORMATION ECONOMY, 2016, : 80 - 86
  • [29] A distributed intelligent mobile application for analyzing travel big data analytics
    Visuwasam, L. Maria Michael
    Raj, D. Paul
    PEER-TO-PEER NETWORKING AND APPLICATIONS, 2020, 13 (06) : 2036 - 2052
  • [30] Making the Most of Big Data and Data Analytics
    Turner, Shawn M.
    ITE JOURNAL-INSTITUTE OF TRANSPORTATION ENGINEERS, 2021, 91 (02): : 24 - 26