Rethinking Data Management for Big Data Scientific Workflows

被引:0
|
作者
Vahi, Karan [1 ]
Rynge, Mats [1 ]
Juve, Gideon [1 ]
Mayani, Rajiv [1 ]
Deelman, Ewa [1 ]
机构
[1] Univ So Calif, Inst Informat Sci, Marina Del Rey, CA 90292 USA
关键词
Pegasus; workflows; object stores; Pegasus Lite; data staging site; data management; cloud;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Scientific workflows consist of tasks that operate on input data to generate new data products that are used by subsequent tasks. Workflow management systems typically stage data to computational sites before invoking the necessary computations. In some cases data may be accessed using remote I/O. There are limitations with these approaches, however. First, the storage at a computational site may be limited and not able to accommodate the necessary input and intermediate data. Second, even if there is enough storage, it is sometimes managed by a filesystem with limited scalability. In recent years, object stores have been shown to provide a scalable way to store and access large datasets, however, they provide a limited set of operations (retrieve, store and delete) that do not always match the requirements of the workflow tasks. In this paper, we show how scientific workflows can take advantage of the capabilities of object stores without requiring users to modify their workflow-based applications or scientific codes. We present two general approaches, one that exclusively uses object stores to store all the files accessed and generated by a workflow, while the other relies on the shared filesystem for caching intermediate data sets. We have implemented both of these approaches in the Pegasus Workflow Management System and have used them to execute workflows in variety of execution environments ranging from traditional supercomputing environments that have a shared filesystem to dynamic environments like Amazon AWS and the Open Science Grid that only offer remote object stores. As a result, Pegasus users can easily migrate their applications from a shared filesystem deployment to one using object stores without changing their application codes.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Designing and Evaluating Scientific Workflows for Big Data Interactions
    Etemadpour, Ronak
    Murray, Paul
    Bomhoff, Matthew
    Lyons, Eric
    Forbes, Angus Graeme
    [J]. 2015 BIG DATA VISUAL ANALYTICS (BDVA), 2015,
  • [2] Addressing the Shimming Problem in Big Data Scientific Workflows
    Mohan, Aravind
    Lu, Shiyong
    Kotov, Alexander
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON SERVICES COMPUTING (SCC 2014), 2014, : 347 - 354
  • [3] Data Management Challenges of Data-Intensive Scientific Workflows
    Deelman, Ewa
    Chervenak, Ann
    [J]. CCGRID 2008: EIGHTH IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, VOLS 1 AND 2, PROCEEDINGS, 2008, : 687 - 692
  • [4] OverFlow: Multi-Site Aware Big Data Management for Scientific Workflows on Clouds
    Tudoran, Radu
    Costan, Alexandru
    Antoniu, Gabriel
    [J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2016, 4 (01) : 76 - 89
  • [5] A big data analytics framework for scientific data management
    Fiore, Sandro
    Palazzo, Cosimo
    D'Anca, Alessandro
    Foster, Ian
    Williams, Dean N.
    Aloisio, Giovanni
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2013,
  • [6] Nonintrusive collection and management of data provenance in scientific workflows
    Tylissanakis, Giorgos
    Cotronis, Yiannis
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2012, 24 (18): : 2268 - 2281
  • [7] Accelerating Scientific Workflows with Tiered Data Management System
    Cheng, Peng
    Lu, Yutong
    Du, Yunfei
    Chen, Zhiguang
    [J]. IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 75 - 82
  • [8] Integration of modern data management practice with scientific workflows
    Killeen, Neil E. B.
    Lohrey, Jason M.
    Farrell, Michael
    Liu, Wilson
    Garic, Slavisa
    Abramson, David
    Hoang Nguyen
    Egan, Gary
    [J]. 2012 IEEE 8TH INTERNATIONAL CONFERENCE ON E-SCIENCE (E-SCIENCE), 2012,
  • [9] Forward Observer system for radar data workflows: Big data management in the field
    Knepper, Richard
    Standish, Matthew
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2017, 76 : 92 - 97
  • [10] Securing Big Data Scientific Workflows via Trusted Heterogeneous Environments
    Mofrad, Saeid
    Ahmed, Ishtiaq
    Zhang, Fengwei
    Lu, Shiyong
    Yang, Ping
    Cui, Heming
    [J]. IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2022, 19 (06) : 4187 - 4203