Parallel membership queries on very large scientific data sets using bitmap indexes

被引：10

作者：

Yildiz, Beytullah ^{[1
]}

Wu, Kesheng ^{[1
]}

Byna, Suren ^{[1
]}

Shoshani, Arie ^{[1
]}

机构：

[1] Lawrence Berkeley Natl Lab, Computat Res Div, Mail Stop 50B-3238,1 Cyclotron Rd, Berkeley, CA 94720 USA

来源：

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2019年 / 31卷 / 15期

关键词：

big data; bitmap index; data management; membership query; parallel query; scientific data;

D O I：

10.1002/cpe.5157

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Many scientific applications produce very large amounts of data as advances in hardware fuel computing and experimental facilities. Managing and analyzing massive quantities of scientific data is challenging as data are often stored in specific formatted files, such as HDF5 and NetCDF, which do not offer appropriate search capabilities. In this research, we investigated a special class of search capability, called membership query, to identify whether queried elements of a set are members of an attribute. Attributes that naturally have classification values appear frequently in scientific domains such as category and object type as well as in daily life such as zip code and occupation. Because classification attribute values are discrete and require random data access, performing a membership query on a large scientific data set creates challenges. We applied bitmap indexing and parallelization to membership queries to overcome these challenges. Bitmap indexing provides high performance not only for low cardinality attributes but also for high cardinality attributes, such as floating-point variables, electric charge, or momentum in a particle physics data set, due to compression algorithms such as Word-Aligned Hybrid. We conducted experiments, in a highly parallelized environment, on data obtained from a particle accelerator model and a synthetic data set.

引用

页数：15

共 50 条

[41] Evaluating Mixed Patterns on Large Data Graphs Using Bitmap Views
Wu, Xiaoying
Theodoratos, Dimitri
Skoutas, Dimitrios
Lan, Michael
[J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2019), PT I, 2019, 11446 : 553 - 570
[42] PARALLEL PROCESSING OF LARGE DATA SETS IN PARTICLE PHYSICS
Rotaru, Marina
Ciubancan, Mihai
Stoicea, Gabriel
[J]. ROMANIAN JOURNAL OF PHYSICS, 2016, 61 (1-2): : 245 - 252
[43] CLUSTERING VERY LARGE DATA SETS USING A LOW MEMORY MATRIX FACTORED REPRESENTATION
Littau, David
Boley, Daniel
[J]. COMPUTATIONAL INTELLIGENCE, 2009, 25 (02) : 114 - 135
[44] Finding approximate solutions to combinatorial problems with very large data sets using BIRCH
Harrington, Justin
Salibian-Barrera, Matias
[J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2010, 54 (03) : 655 - 667
[45] Enabling Ad Hoc Queries over Low-Level Scientific Data Sets
Chiu, David
Agrawal, Gagan
[J]. SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, PROCEEDINGS, 2009, 5566 : 218 - 236
[46] Declustering large multidimensional data sets for range queries over heterogeneous disks
Lee, J
Winslett, M
Ma, XS
Yu, SK
[J]. SSDBM 2002: 15TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 2003, : 212 - 221
[47] USING MAXDIFF FOR EVALUATING VERY LARGE SETS OF ITEMS
Wirth, Ralph
Wolfrath, Anette
[J]. PROCEEDINGS OF THE SAWTOOTH SOFTWARE CONFERENCE 2012, 2012, : 59 - 78
[48] Selective sampling for approximate clustering of very large data sets
Wang, Liang
Bezdek, James C.
Leckie, Christopher
Kotagiri, Ramamohanarao
[J]. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2008, 23 (03) : 313 - 331
[49] Fixed rank kriging for very large spatial data sets
Cressie, Noel
Johannesson, Gardar
[J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2008, 70 : 209 - 226
[50] A Geometric Approach to Train SVM on Very Large Data Sets
Zeng, Zhi-Qiang
Xu, Hua-Rong
Xie, Yan-Qi
Gao, Ji
[J]. 2008 3RD INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEM AND KNOWLEDGE ENGINEERING, VOLS 1 AND 2, 2008, : 991 - +

← 1 2 3 4 5 →