PaPar: A Parallel Data Partitioning Framework for Big Data Applications

被引:3
|
作者
Wang, Hao [1 ]
Zhang, Jing [1 ]
Zhang, Da [1 ]
Pumma, Sarunya [1 ]
Feng, Wu-chun [1 ]
机构
[1] Virginia Tech, Dept Comp Sci, Blacksburg, VA 24061 USA
关键词
Partition; Skew; Big Data; MapReduce; MPI;
D O I
10.1109/IPDPS.2017.119
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Today, big data applications can generate large-scale data sets at an unprecedented rate; and scientists have turned to parallel and distributed systems for data analysis. Although many big data processing systems provide advanced mechanisms to partition data and tackle the computational skew, it is difficult to efficiently implement skew-resistant mechanisms, because the runtime of different partitions not only depends on input data size but also algorithms that will be applied on data. As a result, many research efforts have been undertaken to explore user-defined partitioning methods for different types of applications and algorithms. However, manually writing application-specific partitioning methods requires significant coding effort, and finding the optimal data partitioning strategy is particularly challenging even for developers that have mastered sufficient application knowledge. In this paper, we propose PaPar, a Parallel data Partitioning framework for big data applications, to simplify the implementations of data partitioning algorithms. PaPar provides a set of computational operators and distribution strategies for programmers to describe desired data partitioning methods. Taking an input data configuration file and a workflow configuration file as the input, PaPar can automatically generate the parallel partitioning codes by formalizing the user-defined workflow as a sequence of key-value operations and matrix-vector multiplications, and efficiently mapping to the parallel implementations with MPI and MapReduce. We apply our approach on two applications: muBLAST, a MPI implementation of BLAST algorithms for biological sequence search; and PowerLyra, a computation and partitioning method for skewed graphs. The experimental results show that compared to the partitioning methods of applications, the codes generated by PaPar can produce the same data partitions with comparable or less partitioning time.
引用
收藏
页码:605 / 614
页数:10
相关论文
共 50 条
  • [31] Dache: A data aware caching for big-data applications using the MapReduce framework
    [J]. Zhao, Y. (yaxiongzhao@google.com), 1600, Tsinghua University (19):
  • [32] Sentiment Analysis of Big Data Applications using Twitter Data with the Help of HADOOP Framework
    Sehgal, Divya
    Agarwal, Ambuj Kumar
    [J]. PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON SYSTEM MODELING & ADVANCEMENT IN RESEARCH TRENDS (SMART-2016), 2016, : 251 - 255
  • [33] Design and Construction of a Big Data Analytics Framework for Health Applications
    Kuo, Mu-Hsing
    Chrimes, Dillon
    Moa, Belaid
    Hu, Wei
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON SMART CITY/SOCIALCOM/SUSTAINCOM (SMARTCITY), 2015, : 631 - 636
  • [34] Dione: A Framework for Automatic Profiling and Tuning Big Data Applications
    Zacheilas, Nikos
    Maroulis, Stathis
    Priovolos, Thanasis
    Kalogeraki, Vana
    Gunopulos, Dimitrios
    [J]. 2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2018, : 1637 - 1640
  • [35] A Framework for Scheduling and Managing Big Data Applications in a Distributed Infrastructure
    Govindarajan, Kannan
    Somasundaram, Thamarai Selvi
    Boulanger, David
    Kumar, Vivekanandan Suresh
    Kinshuk
    [J]. 2015 SEVENTH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING (ICOAC), 2015,
  • [36] A Hyperbolic Space Analytics Framework for Big Network Data and Their Applications
    Stai, Eleni
    Karyotis, Vasileios
    Papavassiliou, Symeon
    [J]. IEEE NETWORK, 2016, 30 (01): : 11 - 17
  • [37] Parallel SLINK for big data
    Poonam Goyal
    Sonal Kumari
    Sumit Sharma
    Sundar Balasubramaniam
    Navneet Goyal
    [J]. International Journal of Data Science and Analytics, 2020, 9 : 339 - 359
  • [38] Parallel SLINK for big data
    Goyal, Poonam
    Kumari, Sonal
    Sharma, Sumit
    Balasubramaniam, Sundar
    Goyal, Navneet
    [J]. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 2020, 9 (03) : 339 - 359
  • [39] Applications for Big Data
    Rine, Christine M.
    [J]. HEALTH & SOCIAL WORK, 2024,
  • [40] A framework that focuses on the data in big data governance
    [J]. IBM Data Management Magazine, 2012, (01):