Improving Parallelism in Data-Intensive Workflows with Distributed Databases

被引:0
|
作者
Watanabe, Elaine Naomi [1 ]
Braghetto, Kelly Rosa [1 ]
机构
[1] Univ Sao Paulo, Inst Math & Stat, Dept Comp Sci, Rua Matao 1010, BR-05508090 Sao Paulo, SP, Brazil
关键词
CHALLENGES; SCIENCE; SYSTEM;
D O I
10.1109/SCC.2018.00034
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The efficient execution of data-intensive workflows relies on strategies to enable parallel data processing, such as partitioning and replicating data across distributed resources. The maximum degree of parallelism a workflow can reach during its execution is usually defined at design time. However, designing workflow models capable to provide an efficient use of distributed computing platforms is not a simple task and requires specialized expertise. Furthermore, since Workflow Management Systems see workflow activities as black-boxes, they are not able to automatically explore data parallelism in the workflow execution. To address this problem, in this work we propose a novel method to automatically improve data parallelism in workflows based on annotations that characterize how activities access and consume data. For an annotated workflow model, the method defines a model transformation and a database setup (including data sharding, replication, and indexing) to support data parallelism in a distributed environment. To evaluate this approach, we implemented and tested two workflows that process up to 20.5 million data objects from real-world datasets. We executed each model in 21 different scenarios in a cluster on a public cloud, using a centralized relational database and a distributed NoSQL database. The automatic parallelization created by the proposed method reduced the execution times of these workflows up to 66.6%, without increasing the monetary costs of their execution.
引用
收藏
页码:209 / 216
页数:8
相关论文
共 50 条
  • [1] Optimizing Distributed Data-Intensive Workflows
    Friese, Ryan D.
    Tallent, Nathan R.
    Schram, Malachi
    Halappanavar, Mahantesh
    Barker, Kevin J.
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2018, : 279 - 289
  • [2] Data throttling for data-intensive workflows
    Park, Sang-Min
    Humphrey, Marty
    [J]. 2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 1796 - 1806
  • [3] Boosting Performance of Data-intensive Analysis Workflows with Distributed Coordinated Caching
    Heidecker, C.
    von Cube, R. F.
    Giffels, M.
    Quast, G.
    Sauter, M.
    Schnepf, M. J.
    [J]. 19TH INTERNATIONAL WORKSHOP ON ADVANCED COMPUTING AND ANALYSIS TECHNIQUES IN PHYSICS RESEARCH, 2020, 1525
  • [4] XML database support for distributed execution of data-intensive scientific workflows
    Hastings, S
    Ribeiro, M
    Langella, S
    Oster, S
    Catalyurek, U
    Pan, T
    Huang, K
    Ferreira, R
    Saltz, J
    Kurc, T
    [J]. SIGMOD RECORD, 2005, 34 (03) : 50 - 55
  • [5] Improving the energy efficiency and performance of data-intensive workflows in virtualized clouds
    Xilong Qu
    Peng Xiao
    Lirong Huang
    [J]. The Journal of Supercomputing, 2018, 74 : 2935 - 2955
  • [6] Improving the energy efficiency and performance of data-intensive workflows in virtualized clouds
    Qu, Xilong
    Xiao, Peng
    Huang, Lirong
    [J]. JOURNAL OF SUPERCOMPUTING, 2018, 74 (07): : 2935 - 2955
  • [7] Experiences with workflows for automating data-intensive bioinformatics
    Spjuth, Ola
    Bongcam-Rudloff, Erik
    Hernandez, Guillermo Carrasco
    Forer, Lukas
    Giovacchini, Mario
    Guimera, Roman Valls
    Kallio, Aleksi
    Korpelainen, Eija
    Kandula, Maciej M.
    Krachunov, Milko
    Kreil, David P.
    Kulev, Ognyan
    Labaj, Pawel P.
    Lampa, Samuel
    Pireddu, Luca
    Schonherr, Sebastian
    Siretskiy, Alexey
    Vassilev, Dimitar
    [J]. BIOLOGY DIRECT, 2015, 10
  • [8] Running Data-Intensive Scientific Workflows in the Cloud
    Sato, Chiaki
    Leslie, Luke M.
    Lee, Young Choon
    Zomaya, Albert Y.
    Ranjan, Rajiv
    [J]. 2014 15TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES (PDCAT 2014), 2014, : 180 - 185
  • [9] Data Management Challenges of Data-Intensive Scientific Workflows
    Deelman, Ewa
    Chervenak, Ann
    [J]. CCGRID 2008: EIGHTH IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, VOLS 1 AND 2, PROCEEDINGS, 2008, : 687 - 692
  • [10] Toward efficient execution of data-intensive workflows
    Sukhoroslov, Oleg
    [J]. JOURNAL OF SUPERCOMPUTING, 2021, 77 (08): : 7989 - 8012