SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics

被引：0

作者：

Min Li

Jian Tan

Yandong Wang

Li Zhang

Valentina Salapura

机构：

[1] IBM Almaden Research Center,

[2] Ohio State University,undefined

来源：

Cluster Computing | 2017年 / 20卷

关键词：

Benchmark; Spark; Workload characterization; Big data analytics;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Spark has been increasingly employed by industries for big data analytics recently, due to its resilience, scalability and efficient in-memory distributed programming model. Meanwhile, the rapid growing community is also actively incubating a rich ecosystem around Spark to tackle various big data challenges. The current benchmarks fall short in providing guidance of development, optimization, configuration and deployment of Spark. To this end, we introduce SparkBench, a Spark specific benchmarking suite. It selectively embraces a set of representative applications to identify various performance bottlenecks and reveals the resource consumption behaviors across execution phases. Overall, SparkBench covers four critical usage patterns of Spark, including machine learning, graph processing, stream computations and SQL query processing. We present comprehensive characterization of resource consumptions, data flows and timing information under different execution patterns and demonstrate that SparkBench can effectively guide the optimization of data analytic platforms to better suit for various workloads.

引用

页码：2575 / 2589

页数：14

共 50 条

[1] SPARKBENCH: a spark benchmarking suite characterizing large-scale in-memory data analytics
Li, Min
Tan, Jian
Wang, Yandong
Zhang, Li
Salapura, Valentina
[J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2017, 20 (03): : 2575 - 2589
[2] A Performance Study on Large-Scale Data Analytics Using Disk-Based and In-Memory Database Systems
Chao, Pingfu
He, Dan
Sadiq, Shazia
Zheng, Kai
Zhou, Xiaofang
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2017, : 247 - 254
[3] In-Memory Distributed Indexing for Large-Scale Media Data Retrieval
Ma, Yinmiao
Liu, Danlu
Scott, Grant
Uhlmann, Jeffrey
Shyu, Chi-Ren
[J]. 2017 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2017, : 232 - 239
[4] On the Implications of Heterogeneous Memory Tiering on Spark In-Memory Analytics
Katsaragakis, Manolis
Masouros, Dimosthenis
Papadopoulos, Lazaros
Catthoor, Francky
Soudris, Dimitrios
[J]. 2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW, 2023, : 945 - 952
[5] YinMem: a distributed parallel indexed in-memory computation system for large scale data analytics
Huang, Yin
Yesha, Yelena
Halem, Milton
Yesha, Yaacov
Zhou, Shujia
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 214 - 222
[6] SparkRDF: In-Memory Distributed RDF Management Framework for Large-Scale Social Data
Xu, Zhichao
Chen, Wei
Gai, Lei
Wang, Tengjiao
[J]. WEB-AGE INFORMATION MANAGEMENT (WAIM 2015), 2015, 9098 : 337 - 349
[7] BioSEAL: In-Memory Biological Sequence Alignment Accelerator for Large-Scale Genomic Data
Kaplan, Roman
Yavits, Leonid
Ginosasr, Ran
[J]. PROCEEDINGS OF THE 13TH ACM INTERNATIONAL SYSTEMS AND STORAGE CONFERENCE (SYSTOR 2020), 2020, : 36 - 48
[8] Characterizing large-scale quantum computers via cycle benchmarking
Alexander Erhard
Joel J. Wallman
Lukas Postler
Michael Meth
Roman Stricker
Esteban A. Martinez
Philipp Schindler
Thomas Monz
Joseph Emerson
Rainer Blatt
[J]. Nature Communications, 10
[9] Characterizing large-scale quantum computers via cycle benchmarking
Erhard, Alexander
Wallman, Joel J.
Postler, Lukas
Meth, Michael
Stricker, Roman
Martinez, Esteban A.
Schindler, Philipp
Monz, Thomas
Emerson, Joseph
Blatt, Rainer
[J]. NATURE COMMUNICATIONS, 2019, 10 (1)
[10] Eager Memory Management for In-Memory Data Analytics
Jang, Hakbeom
Bae, Jonghyun
Ham, Tae Jun
Lee, Jae W.
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2019, E102D (03): : 632 - 636

← 1 2 3 4 5 →