Parallel Query Evaluation as a Scientific Data Service

被引:0
|
作者
Doug, Bin [1 ]
Byna, Surendra [1 ]
Wu, Kesheng [1 ]
机构
[1] Lawrence Berkeley Natl Lab, Computat Res Div, Berkeley, CA 94720 USA
关键词
Scientific Data Services; Parallel Query Processing;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Scientific experiments and simulations produce mountains of data in file formats, such as HDF5, NetCDF, and FITS. Often, a relatively small amount of data holds the key to new scientific insight. Locating that critical information in these large files is challenging because existing solutions need significant user involvement in preparing the data, generating indexes, and answering queries. Data management systems that support querying, such as SciDB, require a costly process of loading data from scientific data formats to these systems. The search results also need to be converted back to a format needed by the subsequent data analysis and visualization tools. These steps are time-consuming, tedious, and possibly error-prone. Toward providing efficient data management directly on these scientific file formats, we introduce a framework called Scientific Data Services (SDS). SDS targets to provide efficient data management optimizations as services. In this paper, we introduce the design and implementation of one such service, the parallel querying service. To answer the queries efficiently, we transparently augment user data with bitmap indexes and ordered datasets. We design the querying service to manage these augmented datasets and to redirect queries automatically to bitmap indexes or to ordered datasets based on their availability and the expected query response time. The generation of bitmap indexes and sorted datasets and querying are parallelized to work on large supercomputers. We show that SDS achieves 22X, 55X, and 62X speedups compared to conventional full-scan approach of sifting through data in answering three queries from a plasma physics analysis application.
引用
收藏
页码:194 / 202
页数:9
相关论文
共 50 条
  • [1] Parallel Query Service for Object-centric Data Management Systems
    Tang, Houjun
    Byna, Suren
    Dong, Bin
    Koziol, Quincey
    [J]. 2020 IEEE 34TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2020), 2020, : 406 - 415
  • [2] Parallel query execution over encrypted data in database-as-a-service (DaaS)
    Ahmad, Awais
    Ahmad, Mudassar
    Habib, Muhammad Asif
    Sarwar, Shahzad
    Chaudhry, Junaid
    Latif, Muhammad Ahsan
    Dar, Saadat Hanif
    Shahid, Muhammad
    [J]. JOURNAL OF SUPERCOMPUTING, 2019, 75 (04): : 2269 - 2288
  • [3] Parallel query execution over encrypted data in database-as-a-service (DaaS)
    Awais Ahmad
    Mudassar Ahmad
    Muhammad Asif Habib
    Shahzad Sarwar
    Junaid Chaudhry
    Muhammad Ahsan Latif
    Saadat Hanif Dar
    Muhammad Shahid
    [J]. The Journal of Supercomputing, 2019, 75 : 2269 - 2288
  • [4] Locality Sensitive Hashing for Data Placement to Optimize Parallel Subgraph Query Evaluation
    Li, Mingdao
    Zhai, Bo
    Jiang, Yuntao
    Li, Yunjian
    Qin, Zheng
    Peng, Peng
    [J]. WEB AND BIG DATA, PT I, APWEB-WAIM 2023, 2024, 14331 : 32 - 47
  • [5] A Query Service for Raw Sensor Data
    McCann, Donall
    Roantree, Mark
    [J]. SMART SENSING AND CONTEXT, PROCEEDINGS, 2009, 5741 : 38 - 50
  • [6] Evaluation of Distributed Query-Based Monitoring over Data Distribution Service
    Bur, Marton
    Varro, Daniel
    [J]. 2019 IEEE 5TH WORLD FORUM ON INTERNET OF THINGS (WF-IOT), 2019, : 674 - 679
  • [7] Dynamic query scheduling in parallel data warehouses
    Märtens, H
    Rahm, E
    Stöhr, T
    [J]. EURO-PAR 2002 PARALLEL PROCESSING, PROCEEDINGS, 2002, 2400 : 321 - 331
  • [8] Dynamic query scheduling in parallel data warehouses
    Märtens, H
    Rahm, E
    Stöhr, T
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2003, 15 (11-12): : 1169 - 1190
  • [9] VOLCANO - AN EXTENSIBLE AND PARALLEL QUERY EVALUATION SYSTEM
    GRAEFE, G
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1994, 6 (01) : 120 - 135
  • [10] Communication Cost in Parallel Query Evaluation A Tutorial
    Suciu, Dan
    [J]. PODS'17: PROCEEDINGS OF THE 36TH ACM SIGMOD-SIGACT-SIGAI SYMPOSIUM ON PRINCIPLES OF DATABASE SYSTEMS, 2017, : 319 - 319