Parallel membership queries on very large scientific data sets using bitmap indexes

被引：10

作者：

Yildiz, Beytullah ^{[1
]}

Wu, Kesheng ^{[1
]}

Byna, Suren ^{[1
]}

Shoshani, Arie ^{[1
]}

机构：

[1] Lawrence Berkeley Natl Lab, Computat Res Div, Mail Stop 50B-3238,1 Cyclotron Rd, Berkeley, CA 94720 USA

来源：

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2019年 / 31卷 / 15期

关键词：

big data; bitmap index; data management; membership query; parallel query; scientific data;

D O I：

10.1002/cpe.5157

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Many scientific applications produce very large amounts of data as advances in hardware fuel computing and experimental facilities. Managing and analyzing massive quantities of scientific data is challenging as data are often stored in specific formatted files, such as HDF5 and NetCDF, which do not offer appropriate search capabilities. In this research, we investigated a special class of search capability, called membership query, to identify whether queried elements of a set are members of an attribute. Attributes that naturally have classification values appear frequently in scientific domains such as category and object type as well as in daily life such as zip code and occupation. Because classification attribute values are discrete and require random data access, performing a membership query on a large scientific data set creates challenges. We applied bitmap indexing and parallelization to membership queries to overcome these challenges. Bitmap indexing provides high performance not only for low cardinality attributes but also for high cardinality attributes, such as floating-point variables, electric charge, or momentum in a particle physics data set, due to compression algorithms such as Word-Aligned Hybrid. We conducted experiments, in a highly parallelized environment, on data obtained from a particle accelerator model and a synthetic data set.

引用

页数：15

共 50 条

[11] Knowledge and Theme Discovery across Very Large Biological Data Sets Using Distributed Queries: A Prototype Combining Unstructured and Structured Data
Mudunuri, Uma S.
Khouja, Mohamad
Repetski, Stephen
Venkataraman, Girish
Che, Anney
Luke, Brian T.
Girard, F. Pascal
Stephens, Robert M.
[J]. PLOS ONE, 2013, 8 (12):
[12] Data mining from extreme data sets: Very large and/or very skewed data sets
Hall, LO
[J]. 2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 2555 - 2555
[13] Joining very large data sets
Johnson, T
Chatziantoniou, D
[J]. DATABASES IN TELECOMMUNICATIONS, 2000, 1819 : 118 - 132
[14] Answering Approximate String Queries on Large Data Sets Using External Memory
Behm, Alexander
Li, Chen
Carey, Michael J.
[J]. IEEE 27TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2011), 2011, : 888 - 899
[15] Efficiently Representing Membership for Variable Large Data Sets
Wei, Jiansheng
Jiang, Hong
Zhou, Ke
Feng, Dan
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2014, 25 (04) : 960 - 970
[16] Parallel visualization of large data sets
Rosenberg, R
Lanzagorta, M
Chtchelkanova, A
Khokhlov, A
[J]. VISUAL DATA EXPLORATION AND ANALYSIS VII, 2000, 3960 : 135 - 143
[17] Empirical modeling of very large data sets using neural networks
Owens, AJ
[J]. IJCNN 2000: PROCEEDINGS OF THE IEEE-INNS-ENNS INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOL VI, 2000, : 302 - 307
[18] Applicability of cluster validation indexes for large data sets
Santibanez, M.
Valdovinos, R. M.
Trueba, A.
Rendon, E.
Alejo, R.
Lopez, E.
[J]. 2013 12TH MEXICAN INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (MICAI 2013), 2013, : 187 - 193
[19] On the Behavior of Indexes for Imprecise Numerical Data and Necessity Measured Queries under Skewed Data Sets
Barranco, Carlos D.
Campana, Jesus R.
Medina, Juan M.
[J]. FLEXIBLE QUERY ANSWERING SYSTEMS, 2011, 7022 : 485 - +
[20] PCA and PLS with very large data sets
Kettaneh, N
Berglund, A
Wold, S
[J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2005, 48 (01) : 69 - 85

← 1 2 3 4 5 →