Flexible Rule-Based Decomposition and Metadata Independence in Modin: A Parallel Dataframe System

被引:4
|
作者
Petersohn, Devin [1 ]
Tang, Dixin [1 ]
Durrani, Rehan [1 ]
Melik-Adamyan, Areg [2 ]
Gonzalez, Joseph E. [1 ]
Joseph, Anthony D. [1 ]
Parameswaran, Aditya G. [1 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA 94720 USA
[2] Intel, Santa Clara, CA USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2021年 / 15卷 / 03期
关键词
DREMEL;
D O I
10.14778/3494124.3494152
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Dataframes have become universally popular as a means to represent data in various stages of structure, and manipulate it using a rich set of operators-thereby becoming an essential tool in the data scientists' toolbox. However, dataframe systems. such as pandas. scale poorly-and are non-interactive on moderate to large datasets. We discuss our experiences developing MODIN, our first cut at a parallel dataframe system, which already has users across several industries and over 1M downloads. MODIN translates pandas functions into a core set of operators that are individually parallelized via columnar, row-wise, or cell-wise decomposition rules that we formalize in this paper. We also introduce metadata independence to allow metadata-such as order and type-to be decoupled from the physical representation and maintained lazily. Using rule-based decomposition and metadata independence, along with careful engineering, MODIN is able to support pandas operations across both rows and columns on very large dataframes-unlike Koalas and Dask DataFrames that either break down or are unable to support such operations, while also being much faster than pandas.
引用
收藏
页码:739 / 751
页数:13
相关论文
共 50 条
  • [1] Rule-based metadata interoperation in heterogeneous digital libraries
    Ding, Hao
    Solvberg, Ingeborg
    [J]. ELECTRONIC LIBRARY, 2007, 25 (02): : 193 - 206
  • [2] A rule-based tabu search technique for power system decomposition
    Mori, H
    Matsuzaki, O
    [J]. 2000 IEEE POWER ENGINEERING SOCIETY SUMMER MEETING, CONFERENCE PROCEEDINGS, VOLS 1-4, 2000, : 1990 - 1995
  • [3] A RULE-BASED ROBOT SCHEDULING SYSTEM FOR FLEXIBLE MANUFACTURING CELLS
    CHEN, HG
    GUERRERO, HH
    [J]. JOURNAL OF INTELLIGENT MANUFACTURING, 1992, 3 (05) : 285 - 296
  • [4] A RULE-BASED PLANNING SYSTEM FOR PARALLEL MULTIPRODUCT MANUFACTURING LINES
    ARTIBA, A
    [J]. PRODUCTION PLANNING & CONTROL, 1994, 5 (04) : 349 - 359
  • [5] Blitz. A rule-based system for massively parallel architectures
    Morgan, K.
    [J]. Conference on Hypercube Concurrent Computers and Applications, 1988,
  • [6] A flexible content adaptation system using a rule-based approach
    He, Jiang
    Gao, Tong
    Hao, Wei
    Yen, I-Ling
    Bastani, Farokh
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2007, 19 (01) : 127 - 140
  • [7] A Rule-based Framework of Metadata Extraction from Scientific Papers
    Guo, Zhixin
    Jin, Hai
    [J]. 2011 TENTH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS TO BUSINESS, ENGINEERING AND SCIENCE (DCABES), 2011, : 400 - 404
  • [8] Parallel backward reasoning algorithm for the rule-based diagnostic expert system
    Shi, Tielin
    Wang, Xue
    [J]. Huazhong Ligong Daxue Xuebao/Journal Huazhong (Central China) University of Science and Technology, 1996, 24 (04):
  • [9] A parallel rule-based system and its experimental usage in membrane computing
    Computer Science Department, Western University of Timişoara, Institute e-Austria Timişoara, B-dul Vasile Pârvan 4, Timişoara
    300223, Romania
    [J]. Scalable Comput. Pract. Exp., 2006, 3 (39-49):
  • [10] PARALLEL ELECTROOPTICAL RULE-BASED SYSTEM FOR FAST EXECUTION OF EXPERT SYSTEMS
    LOURI, A
    NA, JW
    [J]. APPLIED OPTICS, 1993, 32 (11): : 1863 - 1875