Optimizing Near-Data Processing for Spark

被引:1
|
作者
Rachuri, Sri Pramodh [1 ]
Gantasala, Arun [1 ]
Emanuel, Prajeeth [1 ]
Gandhi, Anshul [1 ]
Foley, Robert [2 ]
Puhov, Peter [2 ]
Gkountouvas, Theodoros [3 ]
Lei, Hui [3 ]
机构
[1] SUNY Stony Brook, Stony Brook, NY 11794 USA
[2] FutureWei, Santa Clara, CA USA
[3] OpenInfra Labs, London, England
基金
美国国家科学基金会;
关键词
resource disaggregation; near-data processing; spark; pushdown; modeling;
D O I
10.1109/ICDCS54860.2022.00067
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Resource disaggregation (RD) is an emerging paradigm for data center computing whereby resource-optimized servers are employed to minimize resource fragmentation and improve resource utilization. Apache Spark deployed under the RD paradigm employs a cluster of compute-optimized servers to run executors and a cluster of storage-optimized servers to host the data on HDFS. However, the network transfer from storage to compute cluster becomes a severe bottleneck for big data processing. Near-data processing (NDP) is a concept that aims to alleviate network load in such cases by offloading (or "pushing down") some of the compute tasks to the storage cluster. Employing NDP for Spark under the RD paradigm is challenging because storage-optimized servers have limited computational resources and cannot host the entire Spark processing stack. Further, even if such a lightweight stack could be developed and deployed on the storage cluster, it is not entirely obvious which Spark queries would benefit from pushdown, and which tasks of a given query should be pushed down to storage. This paper presents the design and implementation of a near-data processing system for Spark, SparkNDP, that aims to address the aforementioned challenges. SparkNDP works by implementing novel NDP Spark capabilities on the storage cluster using a lightweight library of SQL operators and then developing an analytical model to help determine which Spark tasks should be pushed down to storage based on the current network and system state. Simulation and prototype implementation results show that SparkNDP can help reduce Spark query execution times when compared to both the default approach of not pushing down any tasks to storage and the outright NDP approach of pushing all tasks to storage.
引用
收藏
页码:636 / 646
页数:11
相关论文
共 50 条
  • [1] NEAR-DATA PROCESSING
    Balasubramonian, Rajeev
    Grot, Boris
    [J]. IEEE MICRO, 2016, 36 (01) : 4 - 5
  • [2] Overcoming Challenges to Near-Data Processing
    Jayasena, Nuwan
    [J]. IEEE MICRO, 2016, 36 (01) : 8 - 9
  • [3] Near-Data Processing of Neural Networks
    Chen, Yunji
    Tao, Jinhua
    [J]. IEEE MICRO, 2016, 36 (01) : 9 - 10
  • [4] An Architecture for Near-Data Processing Systems
    Vermij, Erik
    Hagleitner, Christoph
    Fiorin, Leandro
    Jongerius, Rik
    van Lunteren, Jan
    Bertels, Koen
    [J]. PROCEEDINGS OF THE ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS (CF'16), 2016, : 357 - 360
  • [5] JAFAR: Near-Data Processing for Databases
    Babarinsa, Oreoluwa
    Idreos, Stratos
    [J]. SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, : 2069 - 2070
  • [6] Streaming Analytics with Adaptive Near-data Processing
    Sandur, Atul
    Park, ChanHo
    Volos, Stavros
    Agha, Gul
    Jeon, Myeongjae
    [J]. COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2022, WWW 2022 COMPANION, 2022, : 563 - 566
  • [7] Computing En-Route for Near-Data Processing
    Huang, Jiayi
    Majumder, Pritam
    Kim, Sungkeun
    Fulton, Troy
    Puli, Ramprakash Reddy
    Yum, Ki Hwan
    Kim, Eun Jung
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2021, 70 (06) : 906 - 921
  • [8] Advancing Database System Operators with Near-Data Processing
    dos Santos, Sairo R.
    Moreira, Francis B.
    Kepe, Tiago R.
    Alves, Marco A. Z.
    [J]. 30TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP 2022), 2022, : 127 - 134
  • [9] Sorting big data on heterogeneous near-data processing systems
    Vermij, Erik
    Fiorin, Leandro
    Hagleitner, Christoph
    Bertels, Koen
    [J]. ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2017, 2017, : 349 - 354
  • [10] Biscuit: A Framework for Near-Data Processing of Big Data Workloads
    Gu, Boncheol
    Yoon, Andre S.
    Bae, Duck-Ho
    Jo, Insoon
    Lee, Jinyoung
    Yoon, Jonghyun
    Kang, Jeong-Uk
    Kwon, Moonsang
    Yoon, Chanho
    Cho, Sangyeun
    Jeong, Jaeheon
    Chang, Duckhyun
    [J]. 2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, : 153 - 165