Distributed Data Strategies to Support Large-Scale Data Analysis Across Geo-Distributed Data Centers

被引:21
|
作者
Emara, Tamer Z. [1 ,2 ,3 ]
Huang, Joshua Zhexue [1 ,2 ]
机构
[1] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
[2] Shenzhen Univ, Coll Comp Sci & Software Engn, Big Data Inst, Shenzhen 518060, Peoples R China
[3] Higher Inst Engn & Technol Kafrelsheikh, Kafrelsheikh 33514, Egypt
来源
IEEE ACCESS | 2020年 / 8卷
基金
中国国家自然科学基金;
关键词
Data centers; Big Data; Data models; Task analysis; Data analysis; Distributed databases; Companies; Big data analysis; cloud data centers; distributed computing; random sample partition; wide area analytics; DATA REPLICATION; JOBS;
D O I
10.1109/ACCESS.2020.3027675
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As the volume of data grows rapidly, storing big data in a single data center is no longer feasible. Hence, companies have developed two scenarios to store their big data in multiple data centers. In the first scenario, the company's big data are distributed in multiple data centers without data replication. In the second scenario, data are also stored in multiple data centers but important data are replicated in these data centers to increase data safety and availability. However, in these scenarios, analyzing big data distributed in multiple data centers becomes a challenging task. In this paper, we propose two data distribution strategies to support big data analysis across geo-distributed data centers. In these strategies, we use the recent Random Sample Partition data model to convert big data into sets of random sample data blocks and distribute these data blocks into multiple data centers either without replication or with replication. In analyzing big data in multiple data centers without replication, we randomly select samples of data blocks from multiple data centers and download the sample data blocks to one data center for analysis. In the second strategy with replication of data blocks, we can analyze big data on any data center by randomly selecting a sample of data blocks replicated from other data centers. This strategy avoids data transformation between data centers. We demonstrate the performance of the two strategies in big data analysis by using simulation results produced on one local data center and four AWS data centers in North America, Asia, and Australia.
引用
收藏
页码:178526 / 178538
页数:13
相关论文
共 50 条
  • [41] Reducing the expenses of geo-distributed data centers with portable containerized modules
    Brocanelli, Marco
    Zheng, Wenli
    Wang, Xiaorui
    [J]. PERFORMANCE EVALUATION, 2014, 79 : 104 - 119
  • [42] Time- and Cost- Efficient Task Scheduling across Geo-Distributed Data Centers
    Hu, Zhiming
    Li, Baochun
    Luo, Jun
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2018, 29 (03) : 705 - 718
  • [43] Flutter: Scheduling Tasks Closer to Data Across Geo-Distributed Datacenters
    Hu, Zhiming
    Li, Baochun
    Luo, Jun
    [J]. IEEE INFOCOM 2016 - THE 35TH ANNUAL IEEE INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATIONS, 2016,
  • [44] Efficient Geo-Distributed Data Processing with Rout
    Jayalath, Chamikara
    Eugster, Patrick
    [J]. 2013 IEEE 33RD INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS), 2013, : 470 - 480
  • [45] Low Latency Geo-distributed Data Analytics
    Pu, Qifan
    Ananthanarayanan, Ganesh
    Bodik, Peter
    Kandula, Srikanth
    Akella, Aditya
    Bahl, Paramvir
    Stoica, Ion
    [J]. ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2015, 45 (04) : 421 - 434
  • [46] Compliant Geo-distributed Data Processing in Action
    Beedkar, Kaustubh
    Brekardin, David
    Quiane-Ruiz, Jorge-Anulfo
    Markl, Volker
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 14 (12): : 2843 - 2846
  • [47] Low Latency Geo-distributed Data Analytics
    Pu, Qifan
    Ananthanarayanan, Ganesh
    Bodik, Peter
    Kandula, Srikanth
    Akella, Aditya
    Bahl, Paramvir
    Stoica, Ion
    [J]. SIGCOMM'15: PROCEEDINGS OF THE 2015 ACM CONFERENCE ON SPECIAL INTEREST GROUP ON DATA COMMUNICATION, 2015, : 421 - 434
  • [48] Datum: Managing Data Purchasing and Data Placement in a Geo-Distributed Data Market
    Ren, Xiaoqi
    London, Palma
    Ziani, Juba
    Wierman, Adam
    [J]. IEEE-ACM TRANSACTIONS ON NETWORKING, 2018, 26 (02) : 893 - 905
  • [49] Privacy Regulation Aware Process Mapping in Geo-Distributed Cloud Data Centers
    Zhou, Amelie Chi
    Xiao, Yao
    Gong, Yifan
    He, Bingsheng
    Zhai, Jidong
    Mao, Rui
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (08) : 1872 - 1888
  • [50] Renewable Energy-Aware Big Data Analytics in Geo-Distributed Data Centers with Reinforcement Learning
    Xu, Chenhan
    Wang, Kun
    Li, Peng
    Xia, Rui
    Guo, Song
    Guo, Minyi
    [J]. IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2020, 7 (01): : 205 - 215