Distributed Data Strategies to Support Large-Scale Data Analysis Across Geo-Distributed Data Centers

被引:21
|
作者
Emara, Tamer Z. [1 ,2 ,3 ]
Huang, Joshua Zhexue [1 ,2 ]
机构
[1] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
[2] Shenzhen Univ, Coll Comp Sci & Software Engn, Big Data Inst, Shenzhen 518060, Peoples R China
[3] Higher Inst Engn & Technol Kafrelsheikh, Kafrelsheikh 33514, Egypt
来源
IEEE ACCESS | 2020年 / 8卷
基金
中国国家自然科学基金;
关键词
Data centers; Big Data; Data models; Task analysis; Data analysis; Distributed databases; Companies; Big data analysis; cloud data centers; distributed computing; random sample partition; wide area analytics; DATA REPLICATION; JOBS;
D O I
10.1109/ACCESS.2020.3027675
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As the volume of data grows rapidly, storing big data in a single data center is no longer feasible. Hence, companies have developed two scenarios to store their big data in multiple data centers. In the first scenario, the company's big data are distributed in multiple data centers without data replication. In the second scenario, data are also stored in multiple data centers but important data are replicated in these data centers to increase data safety and availability. However, in these scenarios, analyzing big data distributed in multiple data centers becomes a challenging task. In this paper, we propose two data distribution strategies to support big data analysis across geo-distributed data centers. In these strategies, we use the recent Random Sample Partition data model to convert big data into sets of random sample data blocks and distribute these data blocks into multiple data centers either without replication or with replication. In analyzing big data in multiple data centers without replication, we randomly select samples of data blocks from multiple data centers and download the sample data blocks to one data center for analysis. In the second strategy with replication of data blocks, we can analyze big data on any data center by randomly selecting a sample of data blocks replicated from other data centers. This strategy avoids data transformation between data centers. We demonstrate the performance of the two strategies in big data analysis by using simulation results produced on one local data center and four AWS data centers in North America, Asia, and Australia.
引用
收藏
页码:178526 / 178538
页数:13
相关论文
共 50 条
  • [1] Adaptive Partitioning for Large-Scale Graph Analytics in Geo-Distributed Data Centers
    Zhou, Amelie Chi
    Luo, Juanyun
    Qiu, Ruibo
    Tan, Haobin
    He, Bingsheng
    Mao, Rui
    [J]. 2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 2818 - 2830
  • [2] Workload-Aware Scheduling Across Geo-distributed Data Centers
    Jin, Yibo
    Gao, Yuan
    Qian, Zhuzhong
    Zhai, Mingyu
    Peng, Hui
    Lu, Sanglu
    [J]. 2016 IEEE TRUSTCOM/BIGDATASE/ISPA, 2016, : 1455 - 1462
  • [3] Analysis of Cost Minimization Methods in Geo-Distributed Data Centers
    Khalaf, Ayesheh Ahrari
    Abdalla, Aisha Hassan
    [J]. PROCEEDINGS OF 6TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION ENGINEERING (ICCCE 2016), 2016, : 241 - 245
  • [4] Data Centers Selection for Moving Geo-distributed Big Data to Cloud
    Zhang, Jiangtao
    Yuan, Qiang
    Chen, Shi
    Huang, Hejiao
    Wang, Xuan
    [J]. JOURNAL OF INTERNET TECHNOLOGY, 2019, 20 (01): : 111 - 122
  • [5] Cost Minimization for Big Data Processing in Geo-Distributed Data Centers
    Gu, Lin
    Zeng, Deze
    Li, Peng
    Guo, Song
    [J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2014, 2 (03) : 314 - 323
  • [6] Green Computing with Geo-Distributed Heterogeneous Data Centers
    Pasricha, Sudeep
    Hogade, Ninad
    Siegel, Howard Jay
    Maciejewski, Anthony A.
    [J]. 2019 TENTH INTERNATIONAL GREEN AND SUSTAINABLE COMPUTING CONFERENCE (IGSC), 2019,
  • [7] Investigation of Network Traffic in Geo-Distributed Data Centers
    Koshiba, Yutaka
    Chen, Wuhui
    Yamada, Yuichi
    Tanaka, Takazumi
    Paik, Incheon
    [J]. 2015 IEEE 7TH INTERNATIONAL CONFERENCE ON AWARENESS SCIENCE & TECHNOLOGY (ICAST), 2015, : 174 - 179
  • [8] Geographically distributed data management to support large-scale data analysis
    Emara, Tamer Z.
    Trinh, Thanh
    Huang, Joshua Zhexue
    [J]. SCIENTIFIC REPORTS, 2023, 13 (01)
  • [9] A distributed data management system to support large-scale data analysis
    Emara, Tamer Z.
    Huang, Joshua Zhexue
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2019, 148 : 105 - 115
  • [10] Geographically distributed data management to support large-scale data analysis
    Tamer Z. Emara
    Thanh Trinh
    Joshua Zhexue Huang
    [J]. Scientific Reports, 13