PI-Join: Efficiently processing join queries on massive data

被引：7

作者：

Han, Xixian ^{[2
]}

Li, Jianzhong ^{[2
,3
]}

Yang, Donghua ^{[1
]}

机构：

[1] Harbin Inst Technol, Acad Fundamental & Interdisciplinary Sci, Harbin, Peoples R China

[2] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin, Peoples R China

[3] Harbin Inst Technol, Dept Comp Sci & Engn, Harbin, Peoples R China

来源：

KNOWLEDGE AND INFORMATION SYSTEMS | 2012年 / 32卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Massive data; PI-join; JPIPT construction stage; Result output stage; INDEX STRUCTURE; PERFORMANCE;

D O I：

10.1007/s10115-011-0429-x

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The ratio of disk capacity to disk transfer rate typically increases by 10x per decade. As a result, disk is becoming slower from the view of applications because of the much larger data volume that they need to store and process. In database systems, the less the data volume that is involved in query processing, the better the performance that is achieved. Disk-based join operation is a common but time-consuming database operation, especially in an environment of massive data in which I/O cost dominates the execution time. However, current join algorithms are only suitable for moderate or small data volume. They will incur high I/O cost when performing on massive data because of multi-pass I/O operations on the joined tables and the insensitivity to join selectivity. This paper proposes PI-Join a novel disk-based join algorithm that can efficiently process join queries involving massive data. PI-Join consists of two stages: JPIPT construction stage (JCS) and result output stage (ROS). JCS performs a cache-conscious construction algorithm on join attributes which are kept in column-oriented model to obtain join positional index pair table (JPIPT) of join results faster. The obtained JPIPT is used in ROS to retrieve results in a one-pass sequential selective scan on each table. We provide the correctness proof and cost analysis of PI-Join. Our experimental results indicate that PI-Join has a significant advantage over the existing join algorithms.

引用

页码：527 / 557

页数：31

共 50 条

[31] Parallel processing of "group-by join" queries on shared nothing machines
Hassan, M. Al Hajj
Bamha, M.
[J]. ICSOFT 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES, VOL 1, 2006, : 301 - 307
[32] Parallel processing of "GroupBy-Before-Join" queries in cluster architecture
Taniar, D
Rahayu, JW
[J]. FIRST IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, PROCEEDINGS, 2001, : 178 - 185
[33] Efficient Parallel Processing of Distance Join Queries Over Distributed Graphs
Zhang, Xiaofei
Chen, Lei
Wang, Min
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (03) : 740 - 754
[34] SEPT: an efficient skyline join algorithm on massive data
Han, Xixian
Li, Jianzhong
Gao, Hong
Yang, Chengyu
[J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 43 (02) : 355 - 388
[35] Optimization Algorithm of Massive Data Query Based on JOIN
Zheng Jiajia
Sun Jiasong
[J]. 2014 5TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS), 2014, : 933 - 936
[36] SEPT: an efficient skyline join algorithm on massive data
Xixian Han
Jianzhong Li
Hong Gao
Chengyu Yang
[J]. Knowledge and Information Systems, 2015, 43 : 355 - 388
[37] Surrogate Join for massive data on tertiary storage system
Liu, BL
Li, JZ
Zhang, YQ
[J]. INTERNATIONAL DATABASE ENGINEERING AND APPLICATIONS SYMPOSIUM, PROCEEDINGS, 2004, : 271 - 276
[38] Data partitioning for parallel spatial join processing
Zhou, XF
Abel, DJ
Truffet, D
[J]. ADVANCES IN SPATIAL DATABASES, 1997, 1262 : 178 - 196
[39] Data Partitioning for Parallel Spatial Join Processing
Zhou X.
Abel D.J.
Truffet D.
[J]. GeoInformatica, 1998, 2 (2) : 175 - 204
[40] DECOMPOSITION IN OPTIMIZING DISTRIBUTED JOIN QUERIES
BODORIK, P
RIORDON, JS
[J]. COMPUTING AND INFORMATION, 1989, : 281 - 289

← 1 2 3 4 5 →