Parallel membership queries on very large scientific data sets using bitmap indexes

被引:10
|
作者
Yildiz, Beytullah [1 ]
Wu, Kesheng [1 ]
Byna, Suren [1 ]
Shoshani, Arie [1 ]
机构
[1] Lawrence Berkeley Natl Lab, Computat Res Div, Mail Stop 50B-3238,1 Cyclotron Rd, Berkeley, CA 94720 USA
来源
关键词
big data; bitmap index; data management; membership query; parallel query; scientific data;
D O I
10.1002/cpe.5157
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Many scientific applications produce very large amounts of data as advances in hardware fuel computing and experimental facilities. Managing and analyzing massive quantities of scientific data is challenging as data are often stored in specific formatted files, such as HDF5 and NetCDF, which do not offer appropriate search capabilities. In this research, we investigated a special class of search capability, called membership query, to identify whether queried elements of a set are members of an attribute. Attributes that naturally have classification values appear frequently in scientific domains such as category and object type as well as in daily life such as zip code and occupation. Because classification attribute values are discrete and require random data access, performing a membership query on a large scientific data set creates challenges. We applied bitmap indexing and parallelization to membership queries to overcome these challenges. Bitmap indexing provides high performance not only for low cardinality attributes but also for high cardinality attributes, such as floating-point variables, electric charge, or momentum in a particle physics data set, due to compression algorithms such as Word-Aligned Hybrid. We conducted experiments, in a highly parallelized environment, on data obtained from a particle accelerator model and a synthetic data set.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Evaluating Mixed Patterns on Large Data Graphs Using Bitmap Views
    Wu, Xiaoying
    Theodoratos, Dimitri
    Skoutas, Dimitrios
    Lan, Michael
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2019), PT I, 2019, 11446 : 553 - 570
  • [42] PARALLEL PROCESSING OF LARGE DATA SETS IN PARTICLE PHYSICS
    Rotaru, Marina
    Ciubancan, Mihai
    Stoicea, Gabriel
    [J]. ROMANIAN JOURNAL OF PHYSICS, 2016, 61 (1-2): : 245 - 252
  • [43] CLUSTERING VERY LARGE DATA SETS USING A LOW MEMORY MATRIX FACTORED REPRESENTATION
    Littau, David
    Boley, Daniel
    [J]. COMPUTATIONAL INTELLIGENCE, 2009, 25 (02) : 114 - 135
  • [44] Finding approximate solutions to combinatorial problems with very large data sets using BIRCH
    Harrington, Justin
    Salibian-Barrera, Matias
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2010, 54 (03) : 655 - 667
  • [45] Enabling Ad Hoc Queries over Low-Level Scientific Data Sets
    Chiu, David
    Agrawal, Gagan
    [J]. SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, PROCEEDINGS, 2009, 5566 : 218 - 236
  • [46] Declustering large multidimensional data sets for range queries over heterogeneous disks
    Lee, J
    Winslett, M
    Ma, XS
    Yu, SK
    [J]. SSDBM 2002: 15TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 2003, : 212 - 221
  • [47] USING MAXDIFF FOR EVALUATING VERY LARGE SETS OF ITEMS
    Wirth, Ralph
    Wolfrath, Anette
    [J]. PROCEEDINGS OF THE SAWTOOTH SOFTWARE CONFERENCE 2012, 2012, : 59 - 78
  • [48] Selective sampling for approximate clustering of very large data sets
    Wang, Liang
    Bezdek, James C.
    Leckie, Christopher
    Kotagiri, Ramamohanarao
    [J]. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2008, 23 (03) : 313 - 331
  • [49] Fixed rank kriging for very large spatial data sets
    Cressie, Noel
    Johannesson, Gardar
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2008, 70 : 209 - 226
  • [50] A Geometric Approach to Train SVM on Very Large Data Sets
    Zeng, Zhi-Qiang
    Xu, Hua-Rong
    Xie, Yan-Qi
    Gao, Ji
    [J]. 2008 3RD INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEM AND KNOWLEDGE ENGINEERING, VOLS 1 AND 2, 2008, : 991 - +