Adding data provenance support to Apache Spark

被引:18
|
作者
Interlandi, Matteo [1 ]
Ekmekji, Ari [3 ]
Shah, Kshitij [2 ]
Gulzar, Muhammad Ali [2 ]
Tetali, Sai Deep [2 ]
Kim, Miryung [2 ]
Millstein, Todd [2 ]
Condie, Tyson [2 ]
机构
[1] Microsoft, Redmond, WA 98052 USA
[2] Univ Calif Los Angeles, Los Angeles, CA USA
[3] Stanford Univ, Stanford, CA 94305 USA
来源
VLDB JOURNAL | 2018年 / 27卷 / 05期
关键词
Data provenance; Spark; Debugging; MODEL;
D O I
10.1007/s00778-017-0474-5
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Debugging data processing logic in data-intensive scalable computing (DISC) systems is a difficult and time-consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result, programmers spend countless hours collecting evidence (e.g., from log files) and performing trial-and-error debugging. To aid this effort, we built Titian, a library that enables data provenance-tracking data through transformations-in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders of magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.
引用
收藏
页码:595 / 615
页数:21
相关论文
共 50 条
  • [1] Adding data provenance support to Apache Spark
    Matteo Interlandi
    Ari Ekmekji
    Kshitij Shah
    Muhammad Ali Gulzar
    Sai Deep Tetali
    Miryung Kim
    Todd Millstein
    Tyson Condie
    [J]. The VLDB Journal, 2018, 27 : 595 - 615
  • [2] Titian: Data Provenance Support in Spark
    Interlandi, Matteo
    Shah, Kshitij
    Tetali, Sai Deep
    Gulzar, Muhammad Ali
    Yoo, Seunghyun
    Kim, Miryung
    Millstein, Todd
    Condie, Tyson
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 9 (03): : 216 - 227
  • [3] FITS Data Source for Apache Spark
    Peloton J.
    Arnault C.
    Plaszczynski S.
    [J]. Computing and Software for Big Science, 2018, 2 (1)
  • [4] Big data analytics on Apache Spark
    Salloum S.
    Dautov R.
    Chen X.
    Peng P.X.
    Huang J.Z.
    [J]. International Journal of Data Science and Analytics, 2016, 1 (3-4) : 145 - 164
  • [5] On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science
    Akil, Bilal
    Zhou, Ying
    Roehm, Uwe
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 303 - 310
  • [6] Big Spatial Data Processing With Apache Spark
    Boyi Shangguan
    Peng Yue
    Wu, Zhaoyan
    Jiang, Liangcun
    [J]. 2017 6TH INTERNATIONAL CONFERENCE ON AGRO-GEOINFORMATICS, 2017, : 239 - 242
  • [7] Big Data Software Analytics with Apache Spark
    Gousios, Georgios
    [J]. PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING - COMPANION (ICSE-COMPANION, 2018, : 542 - 543
  • [8] Geospatial Data Management in Apache Spark: A Tutorial
    Yu, Jia
    Sarwat, Mohamed
    [J]. 2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 2060 - 2063
  • [9] CMS Analysis and Data Reduction with Apache Spark
    Gutsche, Oliver
    Canali, Luca
    Cremer, Illia
    Cremonesi, Matteo
    Elmer, Peter
    Fisk, Ian
    Girone, Maria
    Jayatilaka, Bo
    Kowalkowski, Jim
    Khristenko, Viktor
    Motesnitsalis, Evangelos
    Pivarski, Jim
    Sehrish, Saba
    Surdy, Kacper
    Svyatkovskiy, Alexey
    [J]. 18TH INTERNATIONAL WORKSHOP ON ADVANCED COMPUTING AND ANALYSIS TECHNIQUES IN PHYSICS RESEARCH (ACAT2017), 2018, 1085
  • [10] Apache Spark: A Big Data Processing Engine
    Shaikh, Eman
    Mohiuddin, Iman
    Alufaisan, Yasmeen
    Nahvi, Irum
    [J]. 2019 2ND IEEE MIDDLE EAST AND NORTH AFRICA COMMUNICATIONS CONFERENCE (IEEEMENACOMM'19), 2019, : 220 - 225