RHJoin: A Fast and Space-efficient Join Method for Log Processing in MapReduce

被引:0
|
作者
Tang, Dixin [1 ]
Liu, Taoying [1 ]
Liu, Hong [1 ]
Li, Wei [1 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
关键词
MapReduce; Join; Log Processing; Big data; MAP-REDUCE; SYSTEM;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Equi-join is heavily used in Map Reduce-based log processing. With the rapid growth of dataset sizes, join methods on MapReduce are extensively studied recently. We find that existing join methods usually cannot get high query performance and affordable storage consumption at the same time when faced with a huge amount of log data. They either only optimize one aspect but significantly sacrifice the other or have limited applications. In this paper, after analyzing characteristics of the workloads and underlying MapReduce, we present a join method with specific optimizations for log processing called RHJoin (Repartition Hash Join) and its implementation on Hadoop. In RHJoin, reference tables are partitioned in the pre-processing step, the log table is partitioned on the map side and hash join is executed on the reduce side. The shuffle procedure of MapReduce is also optimized by removing the sort step and overlapping the execution of mappers and reducers. Comprehensive experiments show that RHJoin achieves high query performance with only a small extra storage cost, and has wide application circumstances for log processing.
引用
收藏
页码:975 / 980
页数:6
相关论文
共 50 条
  • [41] Time- and Space-Efficient Sliding Window Top-k Query Processing
    Pripuzic, Kresimir
    Zarko, Ivana Podnar
    Aberer, Karl
    ACM TRANSACTIONS ON DATABASE SYSTEMS, 2015, 40 (01):
  • [42] A stakeholder- and function-based planning method for space-efficient buildings
    Von Both, P.
    SUSTAINABLE BUILT ENVIRONMENT D-A-CH CONFERENCE 2019 (SBE19 GRAZ), 2019, 323
  • [43] Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments
    Darvish, Mitra
    Seiler, Enrico
    Mehringer, Svenja
    Rahn, Rene
    Reinert, Knut
    BIOINFORMATICS, 2022, 38 (17) : 4100 - 4108
  • [44] A fast and efficient method for processing web documents
    Szego, D
    COMPUTATIONAL SCIENCE - ICCS 2004, PT 1, PROCEEDINGS, 2004, 3036 : 553 - 556
  • [45] A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence
    Gotoh, Osamu
    NUCLEIC ACIDS RESEARCH, 2008, 36 (08) : 2630 - 2638
  • [46] Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences
    Seiler, Enrico
    Mehringer, Svenja
    Darvish, Mitra
    Turc, Etienne
    Reinert, Knut
    ISCIENCE, 2021, 24 (07)
  • [47] A unique-order interpolative code for fast querying and space-efficient indexing in information retrieval systems
    Cheng, CS
    Shann, JJJ
    Chung, CP
    ITCC 2004: INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: CODING AND COMPUTING, VOL 2, PROCEEDINGS, 2004, : 229 - 235
  • [48] Unique-order interpolative coding for fast querying and space-efficient indexing in information retrieval systems
    Cheng, CS
    Shann, JJJ
    Chung, CP
    INFORMATION PROCESSING & MANAGEMENT, 2006, 42 (02) : 407 - 428
  • [49] A-DFA: A Time- and Space-Efficient DFA Compression Algorithm for Fast Regular Expression Evaluation
    Becchi, Michela
    Crowley, Patrick
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2013, 10 (01)
  • [50] Fast & Space-Efficient Approximations of Language Edit Distance and RNA folding: An Amnesic Dynamic Programming Approach
    Saha, Barna
    2017 IEEE 58TH ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS), 2017, : 295 - 306