ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with Distributed Stream Processing Frameworks

被引：1

作者：

Henning, Soeren ^{[1
]}

Vogel, Adriano ^{[1
]}

Leichtfried, Michael ^{[2
]}

Ertl, Otmar ^{[2
]}

Rabiser, Rick ^{[3
]}

机构：

[1] Johannes Kepler Univ Linz, JKU Dynatrace Coinnovat Lab, Linz, Austria

[2] Dynatrace LLC, Dynatrace Res, Linz, Austria

[3] Johannes Kepler Univ Linz, LIT CPS Lab, Linz, Austria

来源：

PROCEEDINGS OF THE 15TH ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING, ICPE 2024 | 2024年

关键词：

benchmarking; data shuffling; performance; stream processing; LATENCY;

D O I：

10.1145/3629526.3645036

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Distributed stream processing frameworks help building scalable and reliable applications that perform transformations and aggregations on continuous data streams. This paper introduces ShuffleBench, a novel benchmark to evaluate the performance of modern stream processing frameworks. In contrast to other benchmarks, it focuses on use cases where stream processing frameworks are mainly employed for shuffling (i.e., re-distributing) data records to perform state-local aggregations, while the actual aggregation logic is considered as black-box software components. ShuffleBench is inspired by requirements for near real-time analytics of a large cloud observability platform and takes up benchmarking metrics and methods for latency, throughput, and scalability established in the performance engineering research community. Although inspired by a real-world observability use case, it is highly configurable to allow domain-independent evaluations. ShuffleBench comes as a ready-to-use open-source software utilizing existing Kubernetes tooling and providing implementations for four stateof-the-art frameworks. Therefore, we expect ShuffleBench to be a valuable contribution to both industrial practitioners building stream processing applications and researchers working on new stream processing approaches. We complement this paper with an experimental performance evaluation that employs ShuffleBench with various configurations on Flink, Hazelcast, Kafka Streams, and Spark in a cloud-native environment. Our results show that Flink achieves the highest throughput while Hazelcast processes data streams with the lowest latency.

引用

页码：2 / 13

页数：12

共 50 条

[1] Distributed frameworks and parallel algorithms for processing large-scale geographic data
Hawick, KA
Coddington, PD
James, HA
[J]. PARALLEL COMPUTING, 2003, 29 (10) : 1297 - 1333
[2] Optimizing data stream processing for large-scale applications
Cappellari, Paolo
Roantree, Mark
Chun, Soon Ae
[J]. SOFTWARE-PRACTICE & EXPERIENCE, 2018, 48 (09): : 1607 - 1641
[3] A Survey of Distributed Data Stream Processing Frameworks
Isah, Haruna
Abughofa, Tariq
Mahfuz, Sazia
Ajerla, Dharmitha
Zulkernine, Farhana
Khan, Shahzad
[J]. IEEE ACCESS, 2019, 7 : 154300 - 154316
[4] An Analysis of Distributed Programming Models and Frameworks for Large-scale Graph Processing
Corbellini, Alejandro
Godoy, Daniela
Mateos, Cristian
Schiaffino, Silvia
Zunino, Alejandro
[J]. IETE JOURNAL OF RESEARCH, 2022, 68 (04) : 3065 - 3073
[5] Distributed Data Processing for Large-Scale Simulations on Cloud
Lu, Tianjian
Hoyer, Stephan
Wang, Qing
Hu, Lily
Chen, Yi-Fan
[J]. 2021 JOINT IEEE INTERNATIONAL SYMPOSIUM ON ELECTROMAGNETIC COMPATIBILITY, SIGNAL & POWER INTEGRITY, AND EMC EUROPE (EMC+SIPI AND EMC EUROPE), 2021, : 53 - 58
[6] Predicting the Stability of Large-scale Distributed Stream Processing Systems on the Cloud
Tri Minh Truong
Harwood, Aaron
Sinnott, Richard O.
[J]. CLOSER: PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE, 2017, : 575 - 582
[7] Performance Analysis of Large-scale Distributed Stream Processing Systems on the Cloud
Tri Minh Truong
Harwood, Aaron
Sinnott, Richard O.
Chen, Shiping
[J]. PROCEEDINGS 2018 IEEE 11TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD), 2018, : 754 - 761
[8] Storage optimization for large-scale distributed stream-processing systems
Hildrum, Kirsten
Douglis, Fred
Wolf, Joel L.
Yu, Philip S.
Fleischer, Lisa
Katta, Akshay
[J]. ACM Transactions on Storage, 2008, 3 (04)
[9] Alovera: A Fast Stream Processing System for Large-Scale Data
Zhang, Zhen'An
Zhang, Dongjie
Yu, Xiaopeng
Wang, Jing
He, Chunjiang
Yuan, Pingpeng
Jin, Hai
[J]. 2013 8TH CHINAGRID ANNUAL CONFERENCE (CHINAGRID), 2013, : 74 - 79
[10] On Distributed Deep Network for Processing Large-Scale Sets of Complex Data
Qin Chao
Gao Xiao-guang
Chen Da-qing
[J]. 2016 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN-MACHINE SYSTEMS AND CYBERNETICS (IHMSC), VOL. 1, 2016, : 395 - 399

← 1 2 3 4 5 →