Adding data provenance support to Apache Spark

被引:18
|
作者
Interlandi, Matteo [1 ]
Ekmekji, Ari [3 ]
Shah, Kshitij [2 ]
Gulzar, Muhammad Ali [2 ]
Tetali, Sai Deep [2 ]
Kim, Miryung [2 ]
Millstein, Todd [2 ]
Condie, Tyson [2 ]
机构
[1] Microsoft, Redmond, WA 98052 USA
[2] Univ Calif Los Angeles, Los Angeles, CA USA
[3] Stanford Univ, Stanford, CA 94305 USA
来源
VLDB JOURNAL | 2018年 / 27卷 / 05期
关键词
Data provenance; Spark; Debugging; MODEL;
D O I
10.1007/s00778-017-0474-5
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Debugging data processing logic in data-intensive scalable computing (DISC) systems is a difficult and time-consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result, programmers spend countless hours collecting evidence (e.g., from log files) and performing trial-and-error debugging. To aid this effort, we built Titian, a library that enables data provenance-tracking data through transformations-in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders of magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.
引用
收藏
页码:595 / 615
页数:21
相关论文
共 50 条
  • [1] Adding data provenance support to Apache Spark
    Matteo Interlandi
    Ari Ekmekji
    Kshitij Shah
    Muhammad Ali Gulzar
    Sai Deep Tetali
    Miryung Kim
    Todd Millstein
    Tyson Condie
    [J]. The VLDB Journal, 2018, 27 : 595 - 615
  • [2] Titian: Data Provenance Support in Spark
    Interlandi, Matteo
    Shah, Kshitij
    Tetali, Sai Deep
    Gulzar, Muhammad Ali
    Yoo, Seunghyun
    Kim, Miryung
    Millstein, Todd
    Condie, Tyson
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 9 (03): : 216 - 227
  • [3] Titian: Data provenance support in Spark
    [J]. 1600, Association for Computing Machinery (09):
  • [4] FITS Data Source for Apache Spark
    Peloton J.
    Arnault C.
    Plaszczynski S.
    [J]. Computing and Software for Big Science, 2018, 2 (1)
  • [5] Big data analytics on Apache Spark
    Salloum S.
    Dautov R.
    Chen X.
    Peng P.X.
    Huang J.Z.
    [J]. International Journal of Data Science and Analytics, 2016, 1 (3-4) : 145 - 164
  • [6] Global expansion of apache hadoop/apache spark activities at NTT DATA
    Ranaweera, Ravindra Sandaruwan
    Ajisaka, Akira
    [J]. NTT Technical Review, 2018, 16 (02):
  • [7] On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science
    Akil, Bilal
    Zhou, Ying
    Roehm, Uwe
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 303 - 310
  • [8] Big Spatial Data Processing With Apache Spark
    Boyi Shangguan
    Peng Yue
    Wu, Zhaoyan
    Jiang, Liangcun
    [J]. 2017 6TH INTERNATIONAL CONFERENCE ON AGRO-GEOINFORMATICS, 2017, : 239 - 242
  • [9] Big Data Software Analytics with Apache Spark
    Gousios, Georgios
    [J]. PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING - COMPANION (ICSE-COMPANION, 2018, : 542 - 543
  • [10] Geospatial Data Management in Apache Spark: A Tutorial
    Yu, Jia
    Sarwat, Mohamed
    [J]. 2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 2060 - 2063