Adding data provenance support to Apache Spark

被引：18

作者：

Interlandi, Matteo ^{[1
]}

Ekmekji, Ari ^{[3
]}

Shah, Kshitij ^{[2
]}

Gulzar, Muhammad Ali ^{[2
]}

Tetali, Sai Deep ^{[2
]}

Kim, Miryung ^{[2
]}

Millstein, Todd ^{[2
]}

Condie, Tyson ^{[2
]}

机构：

[1] Microsoft, Redmond, WA 98052 USA

[2] Univ Calif Los Angeles, Los Angeles, CA USA

[3] Stanford Univ, Stanford, CA 94305 USA

来源：

VLDB JOURNAL | 2018年 / 27卷 / 05期

关键词：

Data provenance; Spark; Debugging; MODEL;

D O I：

10.1007/s00778-017-0474-5

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Debugging data processing logic in data-intensive scalable computing (DISC) systems is a difficult and time-consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result, programmers spend countless hours collecting evidence (e.g., from log files) and performing trial-and-error debugging. To aid this effort, we built Titian, a library that enables data provenance-tracking data through transformations-in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders of magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

引用

页码：595 / 615

页数：21

共 50 条

[1] Adding data provenance support to Apache Spark
Matteo Interlandi
Ari Ekmekji
Kshitij Shah
Muhammad Ali Gulzar
Sai Deep Tetali
Miryung Kim
Todd Millstein
Tyson Condie
[J]. The VLDB Journal, 2018, 27 : 595 - 615
[2] Titian: Data Provenance Support in Spark
Interlandi, Matteo
Shah, Kshitij
Tetali, Sai Deep
Gulzar, Muhammad Ali
Yoo, Seunghyun
Kim, Miryung
Millstein, Todd
Condie, Tyson
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 9 (03): : 216 - 227
[3] Titian: Data provenance support in Spark
[J]. 1600, Association for Computing Machinery (09):
[4] FITS Data Source for Apache Spark
Peloton J.
Arnault C.
Plaszczynski S.
[J]. Computing and Software for Big Science, 2018, 2 (1)
[5] Big data analytics on Apache Spark
Salloum S.
Dautov R.
Chen X.
Peng P.X.
Huang J.Z.
[J]. International Journal of Data Science and Analytics, 2016, 1 (3-4) : 145 - 164
[6] Global expansion of apache hadoop/apache spark activities at NTT DATA
Ranaweera, Ravindra Sandaruwan
Ajisaka, Akira
[J]. NTT Technical Review, 2018, 16 (02):
[7] On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science
Akil, Bilal
Zhou, Ying
Roehm, Uwe
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 303 - 310
[8] Big Spatial Data Processing With Apache Spark
Boyi Shangguan
Peng Yue
Wu, Zhaoyan
Jiang, Liangcun
[J]. 2017 6TH INTERNATIONAL CONFERENCE ON AGRO-GEOINFORMATICS, 2017, : 239 - 242
[9] Big Data Software Analytics with Apache Spark
Gousios, Georgios
[J]. PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING - COMPANION (ICSE-COMPANION, 2018, : 542 - 543
[10] Geospatial Data Management in Apache Spark: A Tutorial
Yu, Jia
Sarwat, Mohamed
[J]. 2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 2060 - 2063

← 1 2 3 4 5 →