BlueDBM: Distributed Flash Storage for Big Data Analytics

被引:15
|
作者
Jun, Sang-Woo [1 ]
Liu, Ming [1 ]
Lee, Sungjin [2 ,6 ]
Hicks, Jamey [3 ,7 ]
Ankcorn, John [3 ,4 ]
King, Myron [3 ,8 ]
Xu, Shuotao [1 ]
Arvind [5 ]
机构
[1] MIT, Stata Ctr, 32-G836,32 Vassar St, Cambridge, MA 02139 USA
[2] MIT, 77 Massachusetts Ave, Cambridge, MA 02139 USA
[3] Quanta Res Cambridge, Cambridge, MA USA
[4] MIT, Stata Ctr, 32-G870,32 Vassar St, Cambridge, MA USA
[5] MIT, Stata Ctr, 32-G866,32 Vassar St, Cambridge, MA USA
[6] Inha Univ, Room 1010,High Tech Bldg,100 Inharo, Incheon, South Korea
[7] Accelerated Tech Inc, Cambridge, MA USA
[8] 38 Ashland St, Arlington, MA 02476 USA
来源
ACM TRANSACTIONS ON COMPUTER SYSTEMS | 2016年 / 34卷 / 03期
关键词
Wireless sensor networks; media access control; multichannel; radio interference; time synchronization;
D O I
10.1145/2898996
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Complex data queries, because of their need for random accesses, have proven to be slow unless all the data can be accommodated in DRAM. There are many domains, such as genomics, geological data, and daily Twitter feeds, where the datasets of interest are 5TB to 20TB. For such a dataset, one would need a cluster with 100 servers, each with 128GB to 256GB of DRAM, to accommodate all the data in DRAM. On the other hand, such datasets could be stored easily in the flash memory of a rack-sized cluster. Flash storage has much better random access performance than hard disks, which makes it desirable for analytics workloads. However, currently available off-the-shelf flash storage packaged as SSDs does not make effective use of flash storage because it incurs a great amount of additional overhead during flash device management and network access. In this article, we present BlueDBM, a new system architecture that has flash-based storage with in-store processing capability and a low-latency high-throughput intercontroller network between storage devices. We show that BlueDBM outperforms a flash-based system without these features by a factor of 10 for some important applications. While the performance of a DRAM-centric system falls sharply even if only 5% to 10% of the references are to secondary storage, this sharp performance degradation is not an issue in BlueDBM. BlueDBM presents an attractive point in the cost/performance tradeoff for Big Data analytics.
引用
收藏
页数:31
相关论文
共 50 条
  • [41] A Parallel and Distributed Radial Basis Function Network for Big Data Analytics
    Kamaruddin, S. K.
    Ravi, Vadlamani
    [J]. PROCEEDINGS OF THE 2019 IEEE REGION 10 CONFERENCE (TENCON 2019): TECHNOLOGY, KNOWLEDGE, AND SOCIETY, 2019, : 395 - 399
  • [42] Big Data Distributed Storage and Processing Case Studies
    Islam, Tariqul
    Abid, Mehedi Hasan
    [J]. THIRD INTERNATIONAL CONFERENCE ON IMAGE PROCESSING AND CAPSULE NETWORKS (ICIPCN 2022), 2022, 514 : 826 - 837
  • [43] Boafft: Distributed Deduplication for Big Data Storage in the Cloud
    Luo, Shengmei
    Zhang, Guangyan
    Wu, Chengwen
    Khan, Samee U.
    Li, Keqin
    [J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2020, 8 (04) : 1199 - 1211
  • [44] Storage Solution: A Virtual Distributed Storage And Migration Architecture For Big Data
    Oluwarotimi, Randle
    Fezile, Matsebula
    Tranos, Zuva
    [J]. PROCEEDINGS OF THE 2017 2ND JOINT INTERNATIONAL INFORMATION TECHNOLOGY, MECHANICAL AND ELECTRONIC ENGINEERING CONFERENCE (JIMEC 2017), 2017, 62 : 260 - 264
  • [45] BIG DATA AND LEARNING ANALYTICS IN HIGHER EDUCATION Demystifying Variety, Acquisition, Storage, NLP and Analytics
    Alblawi, Amal S.
    Alhamed, Ahmad A.
    [J]. 2017 IEEE CONFERENCE ON BIG DATA AND ANALYTICS (ICBDA), 2017, : 124 - 129
  • [46] Real-time Big Data Analytics for Multimedia Transmission and Storage
    Wang, Kun
    Mi, Jun
    Xu, Chenhan
    Shu, Lei
    Deng, Der-Jiunn
    [J]. 2016 IEEE/CIC INTERNATIONAL CONFERENCE ON COMMUNICATIONS IN CHINA (ICCC), 2016,
  • [47] Catalina: In-Storage Processing Acceleration for Scalable Big Data Analytics
    Torabzadehkashi, Mahdi
    Rezaei, Siavash
    Heydarigorji, Ali
    Bobarshad, Hosein
    Alves, Vladimir
    Bagherzadeh, Nader
    [J]. 2019 27TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP), 2019, : 430 - 437
  • [48] Vortex: A Stream-oriented Storage Engine For Big Data Analytics
    Edara, Pavan
    Forbes, Jonathan
    Li, Bigang
    [J]. COMPANION OF THE 2024 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, SIGMOD-COMPANION 2024, 2024, : 175 - 187
  • [49] Using Distributed Data over HBase in Big Data Analytics Platform for Clinical Services
    Chrimes, Dillon
    Zamani, Hamid
    [J]. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE, 2017, 2017
  • [50] Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics
    Ramakrishnan, Raghu
    Sridharan, Baskar
    Douceur, John R.
    Kasturi, Pavan
    Krishnamachari-Sampath, Balaji
    Krishnamoorthy, Karthick
    Li, Peng
    Manu, Mitica
    Michaylov, Spiro
    Ramos, Rogerio
    Sharman, Neil
    Xu, Zee
    Barakat, Youssef
    Douglas, Chris
    Draves, Richard
    Naidu, Shrikant S.
    Shastry, Shankar
    Sikaria, Atul
    Sun, Simon
    Venkatesan, Ramarathnam
    [J]. SIGMOD'17: PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2017, : 51 - 63