ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with Distributed Stream Processing Frameworks

被引:1
|
作者
Henning, Soeren [1 ]
Vogel, Adriano [1 ]
Leichtfried, Michael [2 ]
Ertl, Otmar [2 ]
Rabiser, Rick [3 ]
机构
[1] Johannes Kepler Univ Linz, JKU Dynatrace Coinnovat Lab, Linz, Austria
[2] Dynatrace LLC, Dynatrace Res, Linz, Austria
[3] Johannes Kepler Univ Linz, LIT CPS Lab, Linz, Austria
关键词
benchmarking; data shuffling; performance; stream processing; LATENCY;
D O I
10.1145/3629526.3645036
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Distributed stream processing frameworks help building scalable and reliable applications that perform transformations and aggregations on continuous data streams. This paper introduces ShuffleBench, a novel benchmark to evaluate the performance of modern stream processing frameworks. In contrast to other benchmarks, it focuses on use cases where stream processing frameworks are mainly employed for shuffling (i.e., re-distributing) data records to perform state-local aggregations, while the actual aggregation logic is considered as black-box software components. ShuffleBench is inspired by requirements for near real-time analytics of a large cloud observability platform and takes up benchmarking metrics and methods for latency, throughput, and scalability established in the performance engineering research community. Although inspired by a real-world observability use case, it is highly configurable to allow domain-independent evaluations. ShuffleBench comes as a ready-to-use open-source software utilizing existing Kubernetes tooling and providing implementations for four stateof-the-art frameworks. Therefore, we expect ShuffleBench to be a valuable contribution to both industrial practitioners building stream processing applications and researchers working on new stream processing approaches. We complement this paper with an experimental performance evaluation that employs ShuffleBench with various configurations on Flink, Hazelcast, Kafka Streams, and Spark in a cloud-native environment. Our results show that Flink achieves the highest throughput while Hazelcast processes data streams with the lowest latency.
引用
收藏
页码:2 / 13
页数:12
相关论文
共 50 条
  • [1] Distributed frameworks and parallel algorithms for processing large-scale geographic data
    Hawick, KA
    Coddington, PD
    James, HA
    [J]. PARALLEL COMPUTING, 2003, 29 (10) : 1297 - 1333
  • [2] Optimizing data stream processing for large-scale applications
    Cappellari, Paolo
    Roantree, Mark
    Chun, Soon Ae
    [J]. SOFTWARE-PRACTICE & EXPERIENCE, 2018, 48 (09): : 1607 - 1641
  • [3] A Survey of Distributed Data Stream Processing Frameworks
    Isah, Haruna
    Abughofa, Tariq
    Mahfuz, Sazia
    Ajerla, Dharmitha
    Zulkernine, Farhana
    Khan, Shahzad
    [J]. IEEE ACCESS, 2019, 7 : 154300 - 154316
  • [4] An Analysis of Distributed Programming Models and Frameworks for Large-scale Graph Processing
    Corbellini, Alejandro
    Godoy, Daniela
    Mateos, Cristian
    Schiaffino, Silvia
    Zunino, Alejandro
    [J]. IETE JOURNAL OF RESEARCH, 2022, 68 (04) : 3065 - 3073
  • [5] Distributed Data Processing for Large-Scale Simulations on Cloud
    Lu, Tianjian
    Hoyer, Stephan
    Wang, Qing
    Hu, Lily
    Chen, Yi-Fan
    [J]. 2021 JOINT IEEE INTERNATIONAL SYMPOSIUM ON ELECTROMAGNETIC COMPATIBILITY, SIGNAL & POWER INTEGRITY, AND EMC EUROPE (EMC+SIPI AND EMC EUROPE), 2021, : 53 - 58
  • [6] Predicting the Stability of Large-scale Distributed Stream Processing Systems on the Cloud
    Tri Minh Truong
    Harwood, Aaron
    Sinnott, Richard O.
    [J]. CLOSER: PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE, 2017, : 575 - 582
  • [7] Performance Analysis of Large-scale Distributed Stream Processing Systems on the Cloud
    Tri Minh Truong
    Harwood, Aaron
    Sinnott, Richard O.
    Chen, Shiping
    [J]. PROCEEDINGS 2018 IEEE 11TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD), 2018, : 754 - 761
  • [8] Storage optimization for large-scale distributed stream-processing systems
    Hildrum, Kirsten
    Douglis, Fred
    Wolf, Joel L.
    Yu, Philip S.
    Fleischer, Lisa
    Katta, Akshay
    [J]. ACM Transactions on Storage, 2008, 3 (04)
  • [9] Alovera: A Fast Stream Processing System for Large-Scale Data
    Zhang, Zhen'An
    Zhang, Dongjie
    Yu, Xiaopeng
    Wang, Jing
    He, Chunjiang
    Yuan, Pingpeng
    Jin, Hai
    [J]. 2013 8TH CHINAGRID ANNUAL CONFERENCE (CHINAGRID), 2013, : 74 - 79
  • [10] On Distributed Deep Network for Processing Large-Scale Sets of Complex Data
    Qin Chao
    Gao Xiao-guang
    Chen Da-qing
    [J]. 2016 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN-MACHINE SYSTEMS AND CYBERNETICS (IHMSC), VOL. 1, 2016, : 395 - 399