Data Analytics in the Cloud with Flexible MapReduce Workflows

被引:0
|
作者
Goncalves, Carlos [1 ,2 ]
Assuncao, Luis [1 ,2 ]
Cunha, Jose C. [2 ]
机构
[1] Univ Nova Lisboa, Inst Super Engn Lisboa, P-1200 Lisbon, Portugal
[2] Univ Nova Lisboa, Fac Ciencias Tecnol, Dept Informat, CITI, P-1200 Lisbon, Portugal
关键词
MapReduce; Workflow; Text Mining; Cloud; MAP-REDUCE;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data analytic applications are characterized by large data sets that are subject to a series of processing phases. Some of these phases are executed sequentially but others can be executed concurrently or in parallel on clusters, grids or clouds. The MapReduce programming model has been applied to process large data sets in cluster and cloud environments. For developing an application using MapReduce there is a need to install/configure/access specific frameworks such as Apache Hadoop or Elastic MapReduce in Amazon Cloud It would he desirable to provide more flexibility in adjusting such configurations according to the application characteristics. Furthermore the composition. of the multiple phases of a data analytic application requires the specification of all the phases and their orchestration. The original MapReduce model and environment lacks flexible support for such configuration and composition. Recognizing that scientific workflows have been successfully applied to modeling complex applications, this paper describes our experiments on implementing MapReduce as sub-workflows in the A WARD framework (Autonomic Workflow Activities Reconfigurable and Dynamic). A text mining data analytic application is modeled as a complex workflow with multiple phases, where individual workflow nodes support MapReduce computations. As in typical MapReduce environments, the end user only needs to define the application algorithms for input data processing and for the map and reduce functions. In the paper we present experimental results when using the A WARD framework to execute MapReduce workflows deployed over multiple Amazon EC2 (Elastic Compute Cloud) instances.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] A data placement strategy in scientific cloud workflows
    Yuan, Dong
    Yang, Yun
    Liu, Xiao
    Chen, Jinjun
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2010, 26 (08): : 1200 - 1214
  • [22] Optimal Data Placement for Scientific Workflows in Cloud
    Shrivastava, Manish
    JOURNAL OF COMPUTER INFORMATION SYSTEMS, 2024, 64 (04) : 501 - 517
  • [23] Texera: A System for Collaborative and Interactive Data Analytics Using Workflows
    Wang, Zuozhi
    Huang, Yicong
    Ni, Shengquan
    Kumar, Avinash
    Alsudais, Sadeem
    Liu, Xiaozhen
    Lin, Xinyuan
    Ding, Yunyan
    Li, Chen
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (11): : 3580 - 3588
  • [24] Addressing Scientific Rigor in Data Analytics Using Semantic Workflows
    Erickson, John S.
    Sheehan, John
    Bennett, Kristin P.
    McGuinness, Deborah L.
    Provenance and Annotation of Data and Processes, IPAW 2016, 2016, 9672 : 187 - 190
  • [25] EdiFlow: data-intensive interactive workflows for visual analytics
    Benzaken, Veronique
    Fekete, Jean-Daniel
    Hemery, Pierre-Luc
    Khemiri, Wael
    Manolescu, Ioana
    IEEE 27TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2011), 2011, : 780 - 791
  • [26] Data analytics and geostatistical workflows for modeling uncertainty in unconventional reservoirs
    Pyrcz, Michael J.
    BULLETIN OF CANADIAN PETROLEUM GEOLOGY, 2019, 67 (04) : 273 - 282
  • [27] Efficient Data Analytics Over Cloud
    Gupta, Rajeev
    Gupta, Himanshu
    Mohania, Mukesh
    ADVANCES IN COMPUTERS, VOL 90: CONNECTED COMPUTING ENVIRONMENT, 2013, 90 : 367 - 401
  • [28] An Approach in Big Data Analytics to Improve the Velocity of Unstructured Data Using MapReduce
    Sundarakumar, M. R.
    Mahadevan, G.
    Somula, Ramasubbareddy
    Sennan, Sankar
    Rawal, Bharat S.
    INTERNATIONAL JOURNAL OF SYSTEM DYNAMICS APPLICATIONS, 2021, 10 (04)
  • [29] Data analytics and cloud computing technologies
    Hart's E and P, 2021, 96 (04): : 48 - 49
  • [30] Serverless Data Analytics in the IBM Cloud
    Sampe, Josep
    Vernik, Gil
    Sanchez-Artigas, Marc
    Garcia-Lopez, Pedro
    MIDDLEWARE INDUSTRY'18: PROCEEDINGS OF THE 2018 ACM/IFIP/USENIX MIDDLEWARE CONFERENCE (INDUSTRIAL TRACK), 2018, : 1 - 8