PI-Join: Efficiently processing join queries on massive data

被引:7
|
作者
Han, Xixian [2 ]
Li, Jianzhong [2 ,3 ]
Yang, Donghua [1 ]
机构
[1] Harbin Inst Technol, Acad Fundamental & Interdisciplinary Sci, Harbin, Peoples R China
[2] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin, Peoples R China
[3] Harbin Inst Technol, Dept Comp Sci & Engn, Harbin, Peoples R China
基金
中国国家自然科学基金;
关键词
Massive data; PI-join; JPIPT construction stage; Result output stage; INDEX STRUCTURE; PERFORMANCE;
D O I
10.1007/s10115-011-0429-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The ratio of disk capacity to disk transfer rate typically increases by 10x per decade. As a result, disk is becoming slower from the view of applications because of the much larger data volume that they need to store and process. In database systems, the less the data volume that is involved in query processing, the better the performance that is achieved. Disk-based join operation is a common but time-consuming database operation, especially in an environment of massive data in which I/O cost dominates the execution time. However, current join algorithms are only suitable for moderate or small data volume. They will incur high I/O cost when performing on massive data because of multi-pass I/O operations on the joined tables and the insensitivity to join selectivity. This paper proposes PI-Join a novel disk-based join algorithm that can efficiently process join queries involving massive data. PI-Join consists of two stages: JPIPT construction stage (JCS) and result output stage (ROS). JCS performs a cache-conscious construction algorithm on join attributes which are kept in column-oriented model to obtain join positional index pair table (JPIPT) of join results faster. The obtained JPIPT is used in ROS to retrieve results in a one-pass sequential selective scan on each table. We provide the correctness proof and cost analysis of PI-Join. Our experimental results indicate that PI-Join has a significant advantage over the existing join algorithms.
引用
收藏
页码:527 / 557
页数:31
相关论文
共 50 条
  • [21] DHTJoin: processing continuous join queries using DHT networks
    Palma, Wenceslao
    Akbarinia, Reza
    Pacitti, Esther
    Valduriez, Patrick
    [J]. DISTRIBUTED AND PARALLEL DATABASES, 2009, 26 (2-3) : 291 - 317
  • [22] DHTJoin: processing continuous join queries using DHT networks
    Wenceslao Palma
    Reza Akbarinia
    Esther Pacitti
    Patrick Valduriez
    [J]. Distributed and Parallel Databases, 2009, 26
  • [23] Integrity for Join Queries in the Cloud
    di Vimercati, Sabrina De Capitani
    Foresti, Sara
    Jajodia, Sushil
    Paraboschi, Stefano
    Samarati, Pierangela
    [J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2013, 1 (02) : 187 - 200
  • [24] Multiple join processing in data grid
    Yang, DH
    Rasool, Q
    Zhang, ZH
    [J]. FRONTIERS OF WWW RESEARCH AND DEVELOPMENT - APWEB 2006, PROCEEDINGS, 2006, 3841 : 793 - 799
  • [25] Map-Side Join Processing of SPARQL Queries Based on Abstract RDF Data Filtering
    Song, Minjae
    Oh, Hyunsuk
    Seo, Seungmin
    Lee, Kyong-Ho
    [J]. JOURNAL OF DATABASE MANAGEMENT, 2019, 30 (01) : 22 - 40
  • [26] Transformation of continuous aggregation join queries over data streams
    Tran, Tri Minh
    Lee, Byung Suk
    [J]. ADVANCES IN SPATIAL AND TEMPORAL DATABASES, PROCEEDINGS, 2007, 4605 : 330 - +
  • [27] Classic distance join queries using compact data structures
    de Bernardo, Guillermo
    Penabad, Miguel R.
    Corral, Antonio
    Brisaboa, Nieves R.
    [J]. Information Sciences, 2024, 674
  • [28] Efficient processing of continuous join queries using distributed hash tables
    Palma, Wenceslao
    Akbarinia, Reza
    Pacitti, Esther
    Valduriez, Patrick
    [J]. EURO-PAR 2008 PARALLEL PROCESSING, PROCEEDINGS, 2008, 5168 : 632 - 641
  • [29] Parallel processing of "Group-By Join" queries on shared nothing machines
    Hassan, M. Al Hajj
    Bamha, M.
    [J]. SOFTWARE AND DATA TECHNOLOGIES, 2008, 10 : 230 - 241
  • [30] Processing Strategy for Global XQuery Queries Based on XQuery Join Cost
    Park, Jong-Hyun
    Kang, Ji-Hoon
    [J]. JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2010, 26 (02) : 659 - 672