Petabyte scale data mining : Dream or reality?

被引:6
|
作者
Szalay, AS [1 ]
Gray, J [1 ]
Vandenberg, J [1 ]
机构
[1] Johns Hopkins Univ, Dept Phys & Astron, Baltimore, MD 21218 USA
关键词
data mining; large-scale computing; databases; spatial statistics;
D O I
10.1117/12.461427
中图分类号
P1 [天文学];
学科分类号
0704 ;
摘要
Science is becoming very data intensive. Today's astronomy datasets with tens of millions of galaxies already present substantial challenges for data mining(1). In less than 10 years the catalogs are expected to grow to billions of objects, and image archives will reach Petabytes. Imagine having a 100GB database in 1996, when disk scanning speeds were 30MB/s, and database tools were immature. Such a task today is trivial, almost manageable with a laptop. We think that the issue of a PB database will be very similar in six years. In this paper we scale our current experiments in data archiving and analysis on the Sloan Digital Sky Survey(2,3) data six years into the future. We analyze these projections and look at the requirements of performing data mining on such data sets. We conclude that the task scales rather well: we could do the job today, although it would be expensive. There do not seem to be any show-stoppers that would prevent us from storing and using a Petabyte dataset six years from today.
引用
收藏
页码:333 / 338
页数:6
相关论文
共 50 条
  • [1] Mining in the Arctic, a dream or a reality?
    Hermansen, Robert
    [J]. Gluckauf: Die Fachzeitschrift fur Rohstoff, Bergbau und Energie, 2003, 139 (06): : 329 - 333
  • [2] Hive - A Petabyte Scale Data Warehouse Using Hadoop
    Thusoo, Ashish
    Sen Sarma, Joydeep
    Jain, Namit
    Shao, Zheng
    Chakka, Prasad
    Zhang, Ning
    Antony, Suresh
    Liu, Hao
    Murthy, Raghotham
    [J]. 26TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING ICDE 2010, 2010, : 996 - 1005
  • [3] A new petabyte-scale data derivation framework for ATLAS
    Catmore, James
    Cranshaw, Jack
    Gillam, Thomas
    Gramstad, Eirik
    Laycock, Paul
    Ozturk, Nurcan
    Stewart, Graeme Andrew
    [J]. 21ST INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP2015), PARTS 1-9, 2015, 664
  • [4] BBM: Bayesian Browsing Model from Petabyte-scale Data
    Liu, Chao
    Guo, Fan
    Faloutsos, Christos
    [J]. KDD-09: 15TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2009, : 537 - 545
  • [5] NOAA Open Data Dissemination: Petabyte-scale Earth system data in the cloud
    Willett, Denis S.
    Brannock, Jonathan
    Dissen, Jenny
    Keown, Patrick
    Szura, Katelyn
    Brown, Otis B.
    Simonson, Adrienne
    [J]. SCIENCE ADVANCES, 2023, 9 (38)
  • [6] Data Caching for Enterprise-Grade Petabyte-Scale OLAP
    Tang, Chunxu
    Fan, Bin
    Zhao, Jing
    Liang, Chen
    Wang, Yi
    Wang, Beinan
    Qiu, Ziyue
    Qiu, Lu
    Ding, Bowen
    Sun, Shouzhuo
    Che, Saiguang
    Mai, Jiaming
    Chen, Shouwei
    Zhu, Yu
    Xie, Jianjian
    Sun, Yutian
    Li, Yao
    Zhang, Yangjun
    Wang, Ke
    Chen, Mingmin
    [J]. PROCEEDINGS OF THE 2024 USENIX ANNUAL TECHNICAL CONFERENCE, ATC 2024, 2024, : 901 - 915
  • [7] Metocean Data Services: Dream or Reality?
    Wyatt, Paul
    [J]. SEA TECHNOLOGY, 2013, 54 (05) : 7 - 7
  • [8] Hyper Dimension Shuffle: Efficient Data Repartition a Petabyte Scale in SCOPE
    Qiao, Shi
    Nicoara, Adrian
    Sun, Jin
    Friedman, Marc
    Patel, Hiren
    Ekanayake, Jaliya
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 12 (10): : 1113 - 1125
  • [9] Realizing Petabyte Scale Acoustic Modeling
    Parthasarathi, Sree Hari Krishnan
    Sivakrishnan, Nitin
    Ladkat, Pranav
    Strom, Nikko
    [J]. IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2019, 9 (02) : 422 - 432
  • [10] REALITY MINING: A NEW SUBFIELD OF DATA MINING
    Berka, Petr
    [J]. IDIMT-2016- INFORMATION TECHNOLOGY, SOCIETY AND ECONOMY STRATEGIC CROSS-INFLUENCES, 2016, 45 : 259 - 266