In-depth analysis on parallel processing patterns for high-performance Dataframes

被引:0
|
作者
Perera, Niranda [1 ]
Sarker, Arup Kumar [2 ,3 ]
Staylor, Mills [2 ]
von Laszewski, Gregor [3 ]
Shan, Kaiying [2 ]
Kamburugamuve, Supun [1 ]
Widanage, Chathura [1 ]
Abeykoon, Vibhatha [1 ]
Kanewela, Thejaka Amila [1 ]
Fox, Geoffrey [2 ,3 ]
机构
[1] Indiana Univ Alumni, Bloomington, IN 47405 USA
[2] Univ Virginia, Charlottesville, VA 22904 USA
[3] Univ Virginia, Biocomplex Inst & Initiat, Charlottesville, VA 22904 USA
关键词
Dataframes; High-performance computing; Data engineering; Relational algebra; MPI; Distributed Memory Parallel; MODEL; OPTIMIZATION; ALGORITHMS; LOGP;
D O I
10.1016/j.future.2023.07.007
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its efficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.
引用
收藏
页码:250 / 264
页数:15
相关论文
共 50 条
  • [1] PARALLEL PROCESSING MEANS HIGH-PERFORMANCE
    THURBER, KJ
    DATA MANAGEMENT, 1979, 17 (01): : 40 - 44
  • [2] In-depth cross-coupling analysis in high-performance induction motor control
    Amezquita-Brooks, Luis A.
    Ugalde-Loo, Carlos E.
    Liceaga-Castro, Eduardo
    Liceaga-Castro, Jesus
    JOURNAL OF THE FRANKLIN INSTITUTE-ENGINEERING AND APPLIED MATHEMATICS, 2018, 355 (05): : 2142 - 2178
  • [3] Massively Parallel Processing of Whole Genome Sequence Data: An In-Depth Performance Study
    Roy, Abhishek
    Diao, Yanlei
    Evani, Uday
    Abhyankar, Avinash
    Howarth, Clinton
    Le Priol, Remi
    Bloom, Toby
    SIGMOD'17: PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2017, : 187 - 202
  • [4] Designing a Profiling and Visualization Tool for Scalable and In-Depth Analysis of High-Performance GPU Clusters
    Kousha, Pouya
    Ramesh, Bharath
    Suresh, Kaushik Kandadi
    Chu, Ching-Hsiang
    Jain, Arpan
    Sarkauskas, Nick
    Subramoni, Hari
    Panda, Dhabaleswar K.
    2019 IEEE 26TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC), 2019, : 93 - 102
  • [5] A HIGH-PERFORMANCE RECONFIGURABLE PARALLEL PROCESSING ARCHITECTURE
    SHIVELY, RR
    MORGAN, EB
    COPLEY, TW
    GORIN, AL
    PROCEEDINGS : SUPERCOMPUTING 89, 1989, : 505 - 509
  • [6] Insights into the separation performance of MOFs by high-performance liquid chromatography and in-depth modelling
    Qin, Weiwei
    Silvestre, Martin E.
    Brenner-Weiss, Gerald
    Wang, Zhengbang
    Schmitt, Sophia
    Huebner, Jonas
    Franzreb, Matthias
    SEPARATION AND PURIFICATION TECHNOLOGY, 2015, 156 : 249 - 258
  • [7] Parallel language processing system for high-performance computing
    Yamanaka, E
    Shindo, T
    FUJITSU SCIENTIFIC & TECHNICAL JOURNAL, 1997, 33 (01): : 39 - 51
  • [8] Scalable, high-performance data mining with parallel processing
    Freitas, AA
    PRINCIPLES OF DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 1510 : 477 - 477
  • [9] Parallel language processing system for high-performance computing
    Yamanaka, Eiji
    Shindo, Tatsuya
    Fujitsu Scientific and Technical Journal, 1997, 33 (01): : 39 - 51
  • [10] Toward an In-Depth Analysis of Multifidelity High Performance Computing Systems
    Shilpika, Shilpika
    Lusch, Bethany
    Emani, Murali
    Simini, Filippo
    Vishwanath, Venkatram
    Papka, Michael E.
    Ma, Kwan-Liu
    2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022), 2022, : 716 - 725