In-depth analysis on parallel processing patterns for high-performance Dataframes

被引:0
|
作者
Perera, Niranda [1 ]
Sarker, Arup Kumar [2 ,3 ]
Staylor, Mills [2 ]
von Laszewski, Gregor [3 ]
Shan, Kaiying [2 ]
Kamburugamuve, Supun [1 ]
Widanage, Chathura [1 ]
Abeykoon, Vibhatha [1 ]
Kanewela, Thejaka Amila [1 ]
Fox, Geoffrey [2 ,3 ]
机构
[1] Indiana Univ Alumni, Bloomington, IN 47405 USA
[2] Univ Virginia, Charlottesville, VA 22904 USA
[3] Univ Virginia, Biocomplex Inst & Initiat, Charlottesville, VA 22904 USA
关键词
Dataframes; High-performance computing; Data engineering; Relational algebra; MPI; Distributed Memory Parallel; MODEL; OPTIMIZATION; ALGORITHMS; LOGP;
D O I
10.1016/j.future.2023.07.007
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its efficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.
引用
收藏
页码:250 / 264
页数:15
相关论文
共 50 条
  • [41] High-performance parallel computer ADENA
    Mimura, Tadaaki
    Okabayashi, Ichiro
    Okamoto, Tadashi
    Wakatani, Akiyoshi
    Migita, Manabu
    Kadota, Hiroshi
    National technical report, 1990, 36 (05): : 528 - 535
  • [42] High-performance image processing system for powder mixture analysis
    Liang, YF
    MACHINE VISION APPLICATIONS, ARCHITECTURES, AND SYSTEMS INTEGRATION V, 1996, 2908 : 216 - 219
  • [43] High-Performance Parallel and Stream Processing of X-ray Microdiffraction Data on Multicores
    Bauer, Michael A.
    Biem, Alain
    McIntyre, Stewart
    Tamura, Nobumichi
    Xie, Yuzhen
    HIGH PERFORMANCE COMPUTING SYMPOSIUM 2011, 2012, 341
  • [44] A 1.3-GOPS parallel DSP for high-performance image-processing applications
    Hinrichs, W
    Wittenburg, JP
    Lieske, H
    Kloos, H
    Ohmacht, M
    Pirsch, P
    IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2000, 35 (07) : 946 - 952
  • [45] Parallel Colt: A High-Performance Java']Java Library for Scientific Computing and Image Processing
    Wendykier, Piotr
    Nagy, James G.
    ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2010, 37 (03):
  • [47] In-Depth Analysis of the Processing of Nomex Honeycomb Composites: Problems, Techniques and Perspectives
    Zarrouk, Tarik
    Nouari, Mohammed
    Salhi, Jamal-Eddine
    Essaouini, Hilal
    Abbadi, Mohammed
    Abbadi, Ahmed
    Lahlaouti, Mohammed Lhassane
    MACHINES, 2024, 12 (08)
  • [48] Improving wavelet denoising based on an in-depth analysis of the camera color processing
    Seybold, Tamara
    Plichta, Mathias
    Stechele, Walter
    REAL-TIME IMAGE AND VIDEO PROCESSING 2015, 2015, 9400
  • [49] Performance Analysis of ZF and MMSE Equalizers for MIMO Systems: An In-Depth Study of the High SNR Regime
    Jiang, Yi
    Varanasi, Mahesh K.
    Li, Jian
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2011, 57 (04) : 2008 - 2026
  • [50] High-performance distributed video content analysis with parallel-horus
    Seinstra, Frank J.
    Geusebroek, Jan-Mark
    Koelma, Dennis
    Snoek, Cees G. M.
    Worring, Marcel
    Smeulders, Arnold W. M.
    IEEE MULTIMEDIA, 2007, 14 (04) : 64 - 75