Adding data provenance support to Apache Spark

被引：18

作者：

Interlandi, Matteo ^{[1
]}

Ekmekji, Ari ^{[3
]}

Shah, Kshitij ^{[2
]}

Gulzar, Muhammad Ali ^{[2
]}

Tetali, Sai Deep ^{[2
]}

Kim, Miryung ^{[2
]}

Millstein, Todd ^{[2
]}

Condie, Tyson ^{[2
]}

机构：

[1] Microsoft, Redmond, WA 98052 USA

[2] Univ Calif Los Angeles, Los Angeles, CA USA

[3] Stanford Univ, Stanford, CA 94305 USA

来源：

VLDB JOURNAL | 2018年 / 27卷 / 05期

关键词：

Data provenance; Spark; Debugging; MODEL;

D O I：

10.1007/s00778-017-0474-5

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Debugging data processing logic in data-intensive scalable computing (DISC) systems is a difficult and time-consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result, programmers spend countless hours collecting evidence (e.g., from log files) and performing trial-and-error debugging. To aid this effort, we built Titian, a library that enables data provenance-tracking data through transformations-in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders of magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

引用

页码：595 / 615

页数：21

共 50 条

[1] Adding data provenance support to Apache Spark
Matteo Interlandi
Ari Ekmekji
Kshitij Shah
Muhammad Ali Gulzar
Sai Deep Tetali
Miryung Kim
Todd Millstein
Tyson Condie
[J]. The VLDB Journal, 2018, 27 : 595 - 615
[2] Titian: Data Provenance Support in Spark
Interlandi, Matteo
Shah, Kshitij
Tetali, Sai Deep
Gulzar, Muhammad Ali
Yoo, Seunghyun
Kim, Miryung
Millstein, Todd
Condie, Tyson
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 9 (03): : 216 - 227
[3] FITS Data Source for Apache Spark
Peloton J.
Arnault C.
Plaszczynski S.
[J]. Computing and Software for Big Science, 2018, 2 (1)
[4] Big data analytics on Apache Spark
Salloum S.
Dautov R.
Chen X.
Peng P.X.
Huang J.Z.
[J]. International Journal of Data Science and Analytics, 2016, 1 (3-4) : 145 - 164
[5] On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science
Akil, Bilal
Zhou, Ying
Roehm, Uwe
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 303 - 310
[6] Big Spatial Data Processing With Apache Spark
Boyi Shangguan
Peng Yue
Wu, Zhaoyan
Jiang, Liangcun
[J]. 2017 6TH INTERNATIONAL CONFERENCE ON AGRO-GEOINFORMATICS, 2017, : 239 - 242
[7] Big Data Software Analytics with Apache Spark
Gousios, Georgios
[J]. PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING - COMPANION (ICSE-COMPANION, 2018, : 542 - 543
[8] Geospatial Data Management in Apache Spark: A Tutorial
Yu, Jia
Sarwat, Mohamed
[J]. 2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 2060 - 2063
[9] CMS Analysis and Data Reduction with Apache Spark
Gutsche, Oliver
Canali, Luca
Cremer, Illia
Cremonesi, Matteo
Elmer, Peter
Fisk, Ian
Girone, Maria
Jayatilaka, Bo
Kowalkowski, Jim
Khristenko, Viktor
Motesnitsalis, Evangelos
Pivarski, Jim
Sehrish, Saba
Surdy, Kacper
Svyatkovskiy, Alexey
[J]. 18TH INTERNATIONAL WORKSHOP ON ADVANCED COMPUTING AND ANALYSIS TECHNIQUES IN PHYSICS RESEARCH (ACAT2017), 2018, 1085
[10] Apache Spark: A Big Data Processing Engine
Shaikh, Eman
Mohiuddin, Iman
Alufaisan, Yasmeen
Nahvi, Irum
[J]. 2019 2ND IEEE MIDDLE EAST AND NORTH AFRICA COMMUNICATIONS CONFERENCE (IEEEMENACOMM'19), 2019, : 220 - 225

← 1 2 3 4 5 →