In-depth analysis on parallel processing patterns for high-performance Dataframes

被引:0
|
作者
Perera, Niranda [1 ]
Sarker, Arup Kumar [2 ,3 ]
Staylor, Mills [2 ]
von Laszewski, Gregor [3 ]
Shan, Kaiying [2 ]
Kamburugamuve, Supun [1 ]
Widanage, Chathura [1 ]
Abeykoon, Vibhatha [1 ]
Kanewela, Thejaka Amila [1 ]
Fox, Geoffrey [2 ,3 ]
机构
[1] Indiana Univ Alumni, Bloomington, IN 47405 USA
[2] Univ Virginia, Charlottesville, VA 22904 USA
[3] Univ Virginia, Biocomplex Inst & Initiat, Charlottesville, VA 22904 USA
关键词
Dataframes; High-performance computing; Data engineering; Relational algebra; MPI; Distributed Memory Parallel; MODEL; OPTIMIZATION; ALGORITHMS; LOGP;
D O I
10.1016/j.future.2023.07.007
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its efficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.
引用
收藏
页码:250 / 264
页数:15
相关论文
共 50 条
  • [11] A High-Performance Parallel Approach to Image Processing in Distributed Computing
    Rakhimov, Mekhriddin
    Mamadjanov, Doniyor
    Mukhiddinov, Abulkosim
    2020 IEEE 14TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT2020), 2020,
  • [12] Telegraphos: High-performance networking for parallel processing on workstation clusters
    Markatos, EP
    Katevenis, MGH
    SECOND INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, PROCEEDINGS, 1996, : 144 - 153
  • [13] In-depth Performance Analysis of the HyperFlux Spectrometer
    Meade, Jeffrey T.
    Behr, Bradford B.
    Bismilla, Yusuf
    Cenko, Andrew T.
    Hajian, Arsen R.
    ADVANCED BIOMEDICAL AND CLINICAL DIAGNOSTIC SYSTEMS XI, 2013, 8572
  • [14] Exploitation of parallel processing for implementing high-performance deduction systems
    Jindal, Anita
    Kabat, Waldo C.
    Journal of Automated Reasoning, 1992, 8 (01): : 23 - 38
  • [15] Empirical Analysis of Parallel Corpora and In-Depth Analysis Using LIWC
    Park, Chanjun
    Shim, Midan
    Eo, Sugyeong
    Lee, Seolhwa
    Seo, Jaehyung
    Moon, Hyeonseok
    Lim, Heuiseok
    APPLIED SCIENCES-BASEL, 2022, 12 (11):
  • [16] High-Performance Parallel Graph Coloring with Strong Guarantees on Work, Depth, and Quality
    Besta, Maciej
    Carigiet, Armon
    Janda, Kacper
    Vonarburg-Shmaria, Zur
    Gianinazzi, Lukas
    Hoefler, Torsten
    PROCEEDINGS OF SC20: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC20), 2020,
  • [17] High-Performance Real-Time Bus in Parallel Processing System
    Cheng Xin
    Wu Huachun
    ADVANCES IN MECHATRONICS AND CONTROL ENGINEERING, PTS 1-3, 2013, 278-280 : 1043 - 1046
  • [18] Parallel high-performance computing of Bayes estimation for signal processing and metrology
    Garcia, Elmar
    Zschiegner, Nils
    Hausotte, Tino
    2013 INTERNATIONAL CONFERENCE ON COMPUTING, MANAGEMENT AND TELECOMMUNICATIONS (COMMANTEL), 2013, : 212 - 218
  • [19] Patterns of perceptual performance in developmental prosopagnosia: An in-depth case series
    Gerlach, Christian
    Starrfelt, Randi
    COGNITIVE NEUROPSYCHOLOGY, 2021, 38 (01) : 27 - 49
  • [20] An In-depth Performance Analysis and Optimization for Android Screencast
    Li, Xianfeng
    An, Dekai
    PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND APPLICATION ENGINEERING (CSAE2018), 2018,