Tuplex: Data Science in Python']Python at Native Code Speed

被引:10
|
作者
Spiegelberg, Leonhard [1 ]
Yesantharao, Rahul [2 ]
Schwarzkopf, Malte [1 ]
Kraska, Tim [2 ]
机构
[1] Brown Univ, Providence, RI 02912 USA
[2] MIT CSAIL, Cambridge, MA USA
关键词
D O I
10.1145/3448016.3457244
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Today's data science pipelines often rely on user-defined functions (UDFs) written in Python. But interpreted Python code is slow, and Python UDFs cannot be compiled to machine code easily. We present Tuplex, a new data analytics framework that just-in-time compiles developers' natural Python UDFs into efficient, end-to-end optimized native code. Tuplex introduces a novel dual-mode execution model that compiles an optimized fast path for the common case, and falls back on slower exception code paths for data that fail to match the fast path's assumptions. Dual-mode execution is crucial to making end-to-end optimizing compilation tractable: by focusing on the common case, Tuplex keeps the code simple enough to apply aggressive optimizations. Thanks to dual-mode execution, Tuplex pipelines always complete even if exceptions occur, and Tuplex's post-facto exception handling simplifies debugging. We evaluate Tuplex with data science pipelines over real-world datasets. Compared to Spark and Dask, Tuplex improves end-to-end pipeline runtime by 5-91x and comes within 1.1-1.7x of a hand-optimized C++ baseline. Tuplex outperforms other Python compilers by 6x and competes with prior, more limited query compilers. Optimizations enabled by dual-mode processing improve runtime by up to 3x, and Tuplex performs well in a distributed setting.
引用
收藏
页码:1718 / 1731
页数:14
相关论文
共 50 条
  • [1] Tuplex: Robust, Efficient Analytics When Python']Python Rules
    Spiegelberg, Leonhard F.
    Kraska, Tim
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 12 (12): : 1958 - 1961
  • [2] Introduction to data science with Python']Python
    Monteiro, M.
    [J]. EUROPEAN JOURNAL OF CLINICAL INVESTIGATION, 2021, 51 : 14 - 14
  • [3] Geographic Data Science With Python']Python
    Podgorski, Krzysztof
    [J]. INTERNATIONAL STATISTICAL REVIEW, 2024, 92 (01) : 134 - 135
  • [4] The Lompe code: A Python']Python toolbox for ionospheric data analysis
    Hovland, A. O.
    Laundal, K. M.
    Reistad, J. P.
    Hatch, S. M.
    Walker, S. J.
    Madelaire, M.
    Ohma, A.
    [J]. FRONTIERS IN ASTRONOMY AND SPACE SCIENCES, 2022, 9
  • [5] PyDaQu: Python']Python Data Quality Code Generation Based on Data Architecture
    Abughazala, Moamin
    Muccini, Henry
    Qadri, Khitam
    [J]. 2023 ACM/IEEE INTERNATIONAL CONFERENCE ON MODEL DRIVEN ENGINEERING LANGUAGES AND SYSTEMS COMPANION, MODELS-C, 2023, : 60 - 64
  • [6] Python']Python Code and Illustrative Crisis Management Data from Twitter
    Wang, Yen-Yao
    Wang, Tawei
    [J]. JOURNAL OF INFORMATION SYSTEMS, 2022, 36 (03) : 211 - 217
  • [7] Making Python']Python Code Idiomatic by Automatic Refactoring Non-idiomatic Python']Python Code with Python']Pythonic Idioms
    Zhang, Zejun
    Xing, Zhenchang
    Xia, Xin
    Xu, Xiwei
    Zhu, Liming
    [J]. PROCEEDINGS OF THE 30TH ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2022, 2022, : 696 - 708
  • [8] Python']Python in the NERSC Exascale Science Applications Program for Data
    Ronaghi, Zahra
    Thomas, Rollin
    Deslippe, Jack
    Bailey, Stephen
    Gursoy, Doga
    Kisner, Theodore
    Keskitalo, Reijo
    Borrill, Julian
    [J]. PROCEEDINGS OF PYHPC'17: 7TH WORKSHOP ON PYTHON FOR HIGH-PERFORMANCE AND SCIENTIFIC COMPUTING, 2017,
  • [9] Applications of Python']Python to evaluate environmental data science problems
    Kadiyala, Akhil
    Kumar, Ashok
    [J]. ENVIRONMENTAL PROGRESS & SUSTAINABLE ENERGY, 2017, 36 (06) : 1580 - 1586
  • [10] Detecting Memory Errors in Python']Python Native Code by Tracking Object Lifecycle with Reference Count
    Ma, Xutong
    Yan, Jiwei
    Zhang, Hao
    Yan, Jun
    Zhang, Jian
    [J]. 2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE, 2023, : 1429 - 1440