In-depth analysis on parallel processing patterns for high-performance Dataframes

被引:0
|
作者
Perera, Niranda [1 ]
Sarker, Arup Kumar [2 ,3 ]
Staylor, Mills [2 ]
von Laszewski, Gregor [3 ]
Shan, Kaiying [2 ]
Kamburugamuve, Supun [1 ]
Widanage, Chathura [1 ]
Abeykoon, Vibhatha [1 ]
Kanewela, Thejaka Amila [1 ]
Fox, Geoffrey [2 ,3 ]
机构
[1] Indiana Univ Alumni, Bloomington, IN 47405 USA
[2] Univ Virginia, Charlottesville, VA 22904 USA
[3] Univ Virginia, Biocomplex Inst & Initiat, Charlottesville, VA 22904 USA
关键词
Dataframes; High-performance computing; Data engineering; Relational algebra; MPI; Distributed Memory Parallel; MODEL; OPTIMIZATION; ALGORITHMS; LOGP;
D O I
10.1016/j.future.2023.07.007
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its efficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.
引用
收藏
页码:250 / 264
页数:15
相关论文
共 50 条
  • [21] Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis
    Ben-Nun, Tal
    Hoefler, Torsten
    ACM COMPUTING SURVEYS, 2019, 52 (04)
  • [22] Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis
    Besta, Maciej
    Hoefler, Torsten
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (05) : 2584 - 2606
  • [23] Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
    Hoefler, Torsten
    2019 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2019, : 392 - 392
  • [24] HIGH-PERFORMANCE PARALLEL ARCHITECTURES
    ANDERSON, RE
    PROCEEDINGS : SUPERCOMPUTING 89, 1989, : 410 - 415
  • [25] Parallel VSIPL++: An open standard software library for high-performance parallel signal processing
    Lebak, J
    Kepner, J
    Hoffmann, H
    Rutledge, E
    PROCEEDINGS OF THE IEEE, 2005, 93 (02) : 313 - 330
  • [26] A HIGH-PERFORMANCE SINGLE CHIP PROCESSING UNIT FOR PARALLEL PROCESSING AND DATA ACQUISITION-SYSTEMS
    BASTIANELLO, G
    BORGOGNONI, R
    BATTISTA, C
    CABASINO, S
    CABIBBO, N
    FUCCI, A
    LAI, A
    MARZANO, F
    PAOLUCCI, PS
    PECH, J
    SARNO, R
    TODESCO, GM
    TORELLI, M
    TRIPICCIONE, R
    TROSS, W
    VICINI, P
    NUCLEAR INSTRUMENTS & METHODS IN PHYSICS RESEARCH SECTION A-ACCELERATORS SPECTROMETERS DETECTORS AND ASSOCIATED EQUIPMENT, 1993, 324 (03): : 543 - 550
  • [27] Parallel colt: A high-performance java library for scientific computing and image processing
    Wendykier, Piotr
    Nagy, James G.
    ACM Transactions on Mathematical Software, 2010, 37 (03):
  • [28] PROGRAMMING HIGH-PERFORMANCE PARALLEL COMPUTATIONS: FORMAL MODELS AND GRAPHICS PROCESSING UNITS
    Andon, P. I.
    Doroshenko, A. Yu.
    Zherebatt, K. A.
    CYBERNETICS AND SYSTEMS ANALYSIS, 2011, 47 (04) : 659 - 668
  • [29] The Hopkins Verbal Learning Test: an in-depth analysis of recall patterns
    Grenfell-Essam, Rachel
    Hogervorst, Eef
    Rahardjo, Tri Budi W.
    MEMORY, 2018, 26 (04) : 385 - 405
  • [30] In-Depth Analysis of OLAP Query Performance on Heterogeneous Hardware
    Broneske, David
    Drewes, Anna
    Gurumurthy, Bala
    Hajjar, Imad
    Pionteck, Thilo
    Saake, Gunter
    Datenbank-Spektrum, 2021, 21 (02) : 133 - 143