PaPar: A Parallel Data Partitioning Framework for Big Data Applications

被引:3
|
作者
Wang, Hao [1 ]
Zhang, Jing [1 ]
Zhang, Da [1 ]
Pumma, Sarunya [1 ]
Feng, Wu-chun [1 ]
机构
[1] Virginia Tech, Dept Comp Sci, Blacksburg, VA 24061 USA
关键词
Partition; Skew; Big Data; MapReduce; MPI;
D O I
10.1109/IPDPS.2017.119
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Today, big data applications can generate large-scale data sets at an unprecedented rate; and scientists have turned to parallel and distributed systems for data analysis. Although many big data processing systems provide advanced mechanisms to partition data and tackle the computational skew, it is difficult to efficiently implement skew-resistant mechanisms, because the runtime of different partitions not only depends on input data size but also algorithms that will be applied on data. As a result, many research efforts have been undertaken to explore user-defined partitioning methods for different types of applications and algorithms. However, manually writing application-specific partitioning methods requires significant coding effort, and finding the optimal data partitioning strategy is particularly challenging even for developers that have mastered sufficient application knowledge. In this paper, we propose PaPar, a Parallel data Partitioning framework for big data applications, to simplify the implementations of data partitioning algorithms. PaPar provides a set of computational operators and distribution strategies for programmers to describe desired data partitioning methods. Taking an input data configuration file and a workflow configuration file as the input, PaPar can automatically generate the parallel partitioning codes by formalizing the user-defined workflow as a sequence of key-value operations and matrix-vector multiplications, and efficiently mapping to the parallel implementations with MPI and MapReduce. We apply our approach on two applications: muBLAST, a MPI implementation of BLAST algorithms for biological sequence search; and PowerLyra, a computation and partitioning method for skewed graphs. The experimental results show that compared to the partitioning methods of applications, the codes generated by PaPar can produce the same data partitions with comparable or less partitioning time.
引用
收藏
页码:605 / 614
页数:10
相关论文
共 50 条
  • [1] Hypergraph Partitioning for Big Data Applications
    Yang, Wenyin
    Ma, Li
    Cui, Ruchun
    Wang, Guojun
    [J]. 2018 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI), 2018, : 1705 - 1710
  • [2] A parallel computing framework for big data
    Guoliang Chen
    Rui Mao
    Kezhong Lu
    [J]. Frontiers of Computer Science, 2017, 11 : 608 - 621
  • [3] A parallel computing framework for big data
    Chen, Guoliang
    Mao, Rui
    Lu, Kezhong
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2017, 11 (04) : 608 - 621
  • [4] PRIMEBALL: A Parallel Processing Framework Benchmark for Big Data Applications in the Cloud
    Ferrarons, Jaume
    Adhana, Mulu
    Colmenares, Carlos
    Pietrowska, Sandra
    Bentayeb, Fadila
    Darmont, Jerome
    [J]. PERFORMANCE CHARACTERIZATION AND BENCHMARKING, 2014, 8391 : 109 - 124
  • [5] Big Data Applications Using Workflows for Data Parallel Computing
    Wang, Jianwu
    Crawl, Daniel
    Altintas, Ilkay
    Li, Weizhong
    [J]. COMPUTING IN SCIENCE & ENGINEERING, 2014, 16 (04) : 11 - 21
  • [6] Partitioning the Impact of Mobile Applications on Big Data Cloud
    Ahmed, Fayyaz
    [J]. 8TH INTERNATIONAL CONFERENCE ON AMBIENT SYSTEMS, NETWORKS AND TECHNOLOGIES (ANT-2017) AND THE 7TH INTERNATIONAL CONFERENCE ON SUSTAINABLE ENERGY INFORMATION TECHNOLOGY (SEIT 2017), 2017, 109 : 1041 - 1046
  • [7] Parallel and distributed computing for Big Data applications
    Senger, Hermes
    Geyer, Claudio
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (08): : 2412 - 2415
  • [8] A framework for data partitioning for C++ data-intensive applications
    Milidonis, A
    Dimitroulakos, G
    Galanis, MD
    Kakarountas, AP
    Theodoridis, G
    Goutis, C
    Catthoor, F
    [J]. DESIGN AUTOMATION FOR EMBEDDED SYSTEMS, 2004, 9 (02) : 101 - 121
  • [9] A Framework for Data Partitioning for C++ Data-Intensive Applications
    A. Milidonis
    G. Dimitroulakos
    M. D. Galanis
    A. P. Kakarountas
    G. Theodoridis
    C. Goutis
    F. Catthoor
    [J]. Design Automation for Embedded Systems, 2004, 9 : 101 - 121
  • [10] Apache Hama: An Emerging Bulk Synchronous Parallel Computing Framework for Big Data Applications
    Siddique, Kamran
    Akhtar, Zahid
    Yoon, Edward J.
    Jeong, Young-Sik
    Dasgupta, Dipankar
    Kim, Yangwoo
    [J]. IEEE ACCESS, 2016, 4 : 8879 - 8887