A Comparison of Approaches to Large-Scale Data Analysis

被引:0
|
作者
Pavlo, Andrew [1 ]
Paulson, Erik
Rasin, Alexander [1 ]
Abadi, Daniel J.
DeWitt, David J.
Madden, Samuel
Stonebraker, Michael
机构
[1] Brown Univ, Providence, RI 02912 USA
来源
ACM SIGMOD/PODS 2009 CONFERENCE | 2009年
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
There is currently considerable enthusiasm around the Map Reduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system's performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.
引用
收藏
页码:165 / 178
页数:14
相关论文
共 50 条
  • [31] Large-scale data analysis using the Wigner function
    Earnshaw, R. A.
    Lei, C.
    Li, J.
    Mugassabi, S.
    Vourdas, A.
    PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2012, 391 (07) : 2401 - 2407
  • [32] Large-Scale Data Analysis Using Heuristic Methods
    Dzemyda, Gintautas
    Sakalauskas, Leonidas
    INFORMATICA, 2011, 22 (01) : 1 - 10
  • [33] Computational solutions to large-scale data management and analysis
    Schadt, Eric E.
    Linderman, Michael D.
    Sorenson, Jon
    Lee, Lawrence
    Nolan, Garry P.
    NATURE REVIEWS GENETICS, 2010, 11 (09) : 647 - 657
  • [34] Rational choice theory and large-scale data analysis
    Weakliem, DL
    CONTEMPORARY SOCIOLOGY-A JOURNAL OF REVIEWS, 1999, 28 (02) : 246 - 247
  • [35] The HaLoop approach to large-scale iterative data analysis
    Bu, Yingyi
    Howe, Bill
    Balazinska, Magdalena
    Ernst, Michael D.
    VLDB JOURNAL, 2012, 21 (02): : 169 - 190
  • [36] Efficient large-scale data analysis using mapreduce
    Kubo, R., 1600, Nippon Telegraph and Telephone Corp. (10):
  • [37] Computational solutions to large-scale data management and analysis
    Eric E. Schadt
    Michael D. Linderman
    Jon Sorenson
    Lawrence Lee
    Garry P. Nolan
    Nature Reviews Genetics, 2010, 11 : 647 - 657
  • [38] Exploratory data analysis in large-scale genetic studies
    Teo, Yik Y.
    BIOSTATISTICS, 2010, 11 (01) : 70 - 81
  • [39] Large-Scale Collaborative Analysis and Extraction of Web Data
    Weigel, Felix
    Panda, Biswanath
    Riedewald, Mirek
    Gehrke, Johannes
    Calimlim, Manuel
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (02): : 1476 - 1479
  • [40] Large-Scale Analysis of Genetic and Clinical Patient Data
    Ritchie, Marylyn D.
    ANNUAL REVIEW OF BIOMEDICAL DATA SCIENCE, VOL 1, 2018, 1 : 263 - 274