Parallel membership queries on very large scientific data sets using bitmap indexes

被引:10
|
作者
Yildiz, Beytullah [1 ]
Wu, Kesheng [1 ]
Byna, Suren [1 ]
Shoshani, Arie [1 ]
机构
[1] Lawrence Berkeley Natl Lab, Computat Res Div, Mail Stop 50B-3238,1 Cyclotron Rd, Berkeley, CA 94720 USA
来源
关键词
big data; bitmap index; data management; membership query; parallel query; scientific data;
D O I
10.1002/cpe.5157
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Many scientific applications produce very large amounts of data as advances in hardware fuel computing and experimental facilities. Managing and analyzing massive quantities of scientific data is challenging as data are often stored in specific formatted files, such as HDF5 and NetCDF, which do not offer appropriate search capabilities. In this research, we investigated a special class of search capability, called membership query, to identify whether queried elements of a set are members of an attribute. Attributes that naturally have classification values appear frequently in scientific domains such as category and object type as well as in daily life such as zip code and occupation. Because classification attribute values are discrete and require random data access, performing a membership query on a large scientific data set creates challenges. We applied bitmap indexing and parallelization to membership queries to overcome these challenges. Bitmap indexing provides high performance not only for low cardinality attributes but also for high cardinality attributes, such as floating-point variables, electric charge, or momentum in a particle physics data set, due to compression algorithms such as Word-Aligned Hybrid. We conducted experiments, in a highly parallelized environment, on data obtained from a particle accelerator model and a synthetic data set.
引用
收藏
页数:15
相关论文
共 50 条
  • [11] Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data
    Mudunuri, Uma S.
    Khouja, Mohamad
    Repetski, Stephen
    Venkataraman, Girish
    Che, Anney
    Luke, Brian T.
    Girard, F. Pascal
    Stephens, Robert M.
    [J]. PLOS ONE, 2013, 8 (12):
  • [12] Data mining from extreme data sets: Very large and/or very skewed data sets
    Hall, LO
    [J]. 2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 2555 - 2555
  • [13] Joining very large data sets
    Johnson, T
    Chatziantoniou, D
    [J]. DATABASES IN TELECOMMUNICATIONS, 2000, 1819 : 118 - 132
  • [14] Answering Approximate String Queries on Large Data Sets Using External Memory
    Behm, Alexander
    Li, Chen
    Carey, Michael J.
    [J]. IEEE 27TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2011), 2011, : 888 - 899
  • [15] Efficiently Representing Membership for Variable Large Data Sets
    Wei, Jiansheng
    Jiang, Hong
    Zhou, Ke
    Feng, Dan
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2014, 25 (04) : 960 - 970
  • [16] Parallel visualization of large data sets
    Rosenberg, R
    Lanzagorta, M
    Chtchelkanova, A
    Khokhlov, A
    [J]. VISUAL DATA EXPLORATION AND ANALYSIS VII, 2000, 3960 : 135 - 143
  • [17] Empirical modeling of very large data sets using neural networks
    Owens, AJ
    [J]. IJCNN 2000: PROCEEDINGS OF THE IEEE-INNS-ENNS INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOL VI, 2000, : 302 - 307
  • [18] Applicability of cluster validation indexes for large data sets
    Santibanez, M.
    Valdovinos, R. M.
    Trueba, A.
    Rendon, E.
    Alejo, R.
    Lopez, E.
    [J]. 2013 12TH MEXICAN INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (MICAI 2013), 2013, : 187 - 193
  • [19] On the Behavior of Indexes for Imprecise Numerical Data and Necessity Measured Queries under Skewed Data Sets
    Barranco, Carlos D.
    Campana, Jesus R.
    Medina, Juan M.
    [J]. FLEXIBLE QUERY ANSWERING SYSTEMS, 2011, 7022 : 485 - +
  • [20] PCA and PLS with very large data sets
    Kettaneh, N
    Berglund, A
    Wold, S
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2005, 48 (01) : 69 - 85