Data Analytics in the Cloud with Flexible MapReduce Workflows

被引:0
|
作者
Goncalves, Carlos [1 ,2 ]
Assuncao, Luis [1 ,2 ]
Cunha, Jose C. [2 ]
机构
[1] Univ Nova Lisboa, Inst Super Engn Lisboa, P-1200 Lisbon, Portugal
[2] Univ Nova Lisboa, Fac Ciencias Tecnol, Dept Informat, CITI, P-1200 Lisbon, Portugal
关键词
MapReduce; Workflow; Text Mining; Cloud; MAP-REDUCE;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data analytic applications are characterized by large data sets that are subject to a series of processing phases. Some of these phases are executed sequentially but others can be executed concurrently or in parallel on clusters, grids or clouds. The MapReduce programming model has been applied to process large data sets in cluster and cloud environments. For developing an application using MapReduce there is a need to install/configure/access specific frameworks such as Apache Hadoop or Elastic MapReduce in Amazon Cloud It would he desirable to provide more flexibility in adjusting such configurations according to the application characteristics. Furthermore the composition. of the multiple phases of a data analytic application requires the specification of all the phases and their orchestration. The original MapReduce model and environment lacks flexible support for such configuration and composition. Recognizing that scientific workflows have been successfully applied to modeling complex applications, this paper describes our experiments on implementing MapReduce as sub-workflows in the A WARD framework (Autonomic Workflow Activities Reconfigurable and Dynamic). A text mining data analytic application is modeled as a complex workflow with multiple phases, where individual workflow nodes support MapReduce computations. As in typical MapReduce environments, the end user only needs to define the application algorithms for input data processing and for the map and reduce functions. In the paper we present experimental results when using the A WARD framework to execute MapReduce workflows deployed over multiple Amazon EC2 (Elastic Compute Cloud) instances.
引用
下载
收藏
页数:8
相关论文
共 50 条
  • [31] Cloud Supported Building Data Analytics
    Petri, Ioan
    Rana, Omer
    Rezgui, Yacine
    Li, Haijiang
    Beach, Tom
    Zou, Mengsong
    Diaz-Montes, Javier
    Parashar, Manish
    2014 14TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2014, : 641 - 650
  • [32] Data Analytics using Cloud Computing
    Maheshwari, Prakhar
    Singhal, Alankar
    Qadeer, Mohammed A.
    2017 9TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS (CICN), 2017, : 82 - 87
  • [33] Performance models of data parallel DAG workflows for large scale data analytics
    Shi, Juwei
    Lu, Jiaheng
    DISTRIBUTED AND PARALLEL DATABASES, 2023, 41 (03) : 299 - 329
  • [34] Performance models of data parallel DAG workflows for large scale data analytics
    Juwei Shi
    Jiaheng Lu
    Distributed and Parallel Databases, 2023, 41 : 299 - 329
  • [35] Performance Models of Data Parallel DAG Workflows for Large Scale Data Analytics
    Shi, Juwei
    Lu, Jiaheng
    2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDEW 2021), 2021, : 104 - 111
  • [36] A Flexible Qualitative Data Analytics Dashboard
    Chua, Gim Guan
    Lim, Paul Min Chim
    Mak, Mun Thye
    Ng, Wee Siong
    Guo, Shuqiao
    Chan, Ang Loon
    Liang, Desmond Chua Zhen
    PROCEEDINGS OF TENCON 2018 - 2018 IEEE REGION 10 CONFERENCE, 2018, : 1865 - 1869
  • [37] Study on Cloud Storage based on the MapReduce for Big Data
    Huang Yi
    Ma Xinqiang
    Zhang Yongdan
    Liu Youyuan
    PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON MECHATRONICS, ELECTRONIC, INDUSTRIAL AND CONTROL ENGINEERING, 2015, 8 : 1601 - 1605
  • [38] Smart Intermediate Data Transfer for MapReduce on Cloud Computing
    Huang, Tzu-Chi
    Chu, Kuo-Chih
    Rao, Yu-Ruei
    2013 INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA (CLOUDCOM-ASIA), 2013, : 9 - 14
  • [39] Scientific data processing using MapReduce in cloud environments
    Kong, Xiangsheng
    Information Technology Journal, 2013, 12 (23) : 7869 - 7873
  • [40] Running Data-Intensive Scientific Workflows in the Cloud
    Sato, Chiaki
    Leslie, Luke M.
    Lee, Young Choon
    Zomaya, Albert Y.
    Ranjan, Rajiv
    2014 15TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES (PDCAT 2014), 2014, : 180 - 185