In-depth analysis on parallel processing patterns for high-performance Dataframes

被引:0
|
作者
Perera, Niranda [1 ]
Sarker, Arup Kumar [2 ,3 ]
Staylor, Mills [2 ]
von Laszewski, Gregor [3 ]
Shan, Kaiying [2 ]
Kamburugamuve, Supun [1 ]
Widanage, Chathura [1 ]
Abeykoon, Vibhatha [1 ]
Kanewela, Thejaka Amila [1 ]
Fox, Geoffrey [2 ,3 ]
机构
[1] Indiana Univ Alumni, Bloomington, IN 47405 USA
[2] Univ Virginia, Charlottesville, VA 22904 USA
[3] Univ Virginia, Biocomplex Inst & Initiat, Charlottesville, VA 22904 USA
关键词
Dataframes; High-performance computing; Data engineering; Relational algebra; MPI; Distributed Memory Parallel; MODEL; OPTIMIZATION; ALGORITHMS; LOGP;
D O I
10.1016/j.future.2023.07.007
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its efficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.
引用
收藏
页码:250 / 264
页数:15
相关论文
共 50 条
  • [31] Parallel Diagonalization Performance on High-Performance Computers
    Sunderland, Andrew G.
    PARALLEL SCIENTIFIC COMPUTING AND OPTIMIZATION: ADVANCES AND APPLICATIONS, 2009, 27 : 57 - 66
  • [32] An in-depth analysis of the impact of processor affinity on network performance
    Foong, A
    Fung, J
    Newell, D
    2004 12TH IEEE INTERNATIONAL CONFERENCE ON NETWORKS, VOLS 1 AND 2 , PROCEEDINGS: UNITY IN DIVERSITY, 2004, : 244 - 250
  • [33] Performance measurement and analysis of high-performance parallel applications over lambda grid
    Kim, Dongwook
    Jin, Hyun-Wook
    Jeong, Karpjoo
    Lee, Jonghyun
    Noh, Minki
    9TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY: TOWARD NETWORK INNOVATION BEYOND EVOLUTION, VOLS 1-3, 2007, : 792 - +
  • [34] An In-depth Analysis of the Impact of Battery Usage Patterns on Performance of Task Allocation Algorithms in Sparse Mobile Crowdsensing
    Bajaj, Garvita
    Singh, Pushpendra
    MSWIM'19: PROCEEDINGS OF THE 22ND INTERNATIONAL ACM CONFERENCE ON MODELING, ANALYSIS AND SIMULATION OF WIRELESS AND MOBILE SYSTEMS, 2019, : 297 - 306
  • [35] On the Technology of High-Performance Parallel Simulation
    Liu Buquan
    Yao Yiping
    Wang Huaimin
    CHINESE JOURNAL OF ELECTRONICS, 2012, 21 (01): : 1 - 6
  • [36] High-performance parallel implicit CFD
    Gropp, WD
    Kaushik, DK
    Keyes, DE
    Smith, BF
    PARALLEL COMPUTING, 2001, 27 (04) : 337 - 362
  • [37] High-performance parallel computing in industry
    Eldredge, M
    Hughes, TJR
    Ferencz, RM
    Rifai, SM
    Raefsky, A
    Herndon, B
    PARALLEL COMPUTING, 1997, 23 (09) : 1217 - 1233
  • [38] A HIGH-PERFORMANCE PARALLEL ACCESS SPECTROPHOTOMETER
    WILLIS, BG
    FUSTIER, DA
    BONELLI, EJ
    INTERNATIONAL LABORATORY, 1981, 11 (04): : 58 - &
  • [39] A HIGH-PERFORMANCE PARALLEL ACCESS SPECTROPHOTOMETER
    WILLIS, BG
    FUSTIER, DA
    BONELLI, EJ
    AMERICAN LABORATORY, 1981, 13 (06) : 62 - &
  • [40] HIGH-PERFORMANCE PARALLEL GRAPH REDUCTION
    JONES, SLP
    CLACK, C
    SALKILD, J
    LECTURE NOTES IN COMPUTER SCIENCE, 1989, 365 : 193 - 206