Parallel similarity joins on massive high-dimensional data using MapReduce

被引:24
|
作者
Ma, Youzhong [1 ,2 ]
Meng, Xiaofeng [2 ]
Wang, Shaoya [3 ]
机构
[1] Luoyang Normal Univ, Sch Informat & Technol, Luoyang 47102, Peoples R China
[2] Renmin Univ China, Sch Informat, Beijing 100872, Peoples R China
[3] NEC Labs China, Beijing, Peoples R China
来源
关键词
similarity join; MapReduce; symbolic aggregate approximation; high-dimensional data; piecewise aggregate approximation;
D O I
10.1002/cpe.3663
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In this paper, we focus on high-dimensional similarity join (HDSJ) using MapReduce paradigm. As the volume of the data and the number of the dimensions increase, the computation cost of HDSJ will increase exponentially. There is no existing effective approach that can process HDSJ efficiently, so we propose a novel method called symbolic aggregate approximation (SAX)-based HDSJ to deal with the problem. SAX is the abbreviation of symbolic aggregate approximation that is a dimensionality reduction technique and widely used in time series processing, we use SAX to represent the high-dimensional vectors in this paper and reorganize these vectors into groups based on their SAX representations. For the very high-dimensional vectors, we also propose an improved SAX-based HDSJ approach. Finally, we implement SAX-based HDSJ and improved SAX-based HDSJ on Hadoop-0.20.2 and perform comprehensive experiments to test the performance, we also compare SAX-based HDSJ and improved SAX-based HDSJ with the existing method. The experiment results show that our proposed approaches have much better performance than that of the existing method. Copyright (c) 2015 John Wiley & Sons, Ltd.
引用
收藏
页码:166 / 183
页数:18
相关论文
共 50 条
  • [21] High-dimensional similarity retrieval using dimensional choice
    Tahmoush, Dave
    Samet, Hanan
    [J]. SISAP 2008: FIRST INTERNATIONAL WORKSHOP ON SIMILARITY SEARCH AND APPLICATIONS, PROCEEDINGS, 2008, : 35 - 42
  • [22] Analysis of high-dimensional genomic data using MapReduce based probabilistic neural network
    Baliarsingh, Santos Kumar
    Vipsita, Swati
    Gandomi, Amir H.
    Panda, Abhijeet
    Bakshi, Sambit
    Ramasubbareddy, Somula
    [J]. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2020, 195
  • [23] Strategic and suave processing for performing similarity joins using MapReduce
    Mahalakshmi Lakshminarayanan
    William F. Acosta
    Robert C. Green
    Vijay Devabhaktuni
    [J]. The Journal of Supercomputing, 2014, 69 : 930 - 954
  • [24] Strategic and suave processing for performing similarity joins using MapReduce
    Lakshminarayanan, Mahalakshmi
    Acosta, William F.
    Green, Robert C., II
    Devabhaktuni, Vijay
    [J]. JOURNAL OF SUPERCOMPUTING, 2014, 69 (02): : 930 - 954
  • [25] Parallel labeling of massive XML data with MapReduce
    Choi, Hyebong
    Lee, Kyong-Ha
    Lee, Yoon-Joon
    [J]. JOURNAL OF SUPERCOMPUTING, 2014, 67 (02): : 408 - 437
  • [26] Parallel labeling of massive XML data with MapReduce
    Hyebong Choi
    Kyong-Ha Lee
    Yoon-Joon Lee
    [J]. The Journal of Supercomputing, 2014, 67 : 408 - 437
  • [27] Parallel coordinate order for high-dimensional data
    Tilouche, Shaima
    Partovi Nia, Vahid
    Bassetto, Samuel
    [J]. STATISTICAL ANALYSIS AND DATA MINING, 2021, 14 (05) : 501 - 515
  • [28] Parallel Processing of Massive EEG Data with MapReduce
    Wang, Lizhe
    Chen, Dan
    Ranjan, Rajiv
    Khan, Samee U.
    Kolodziej, Joanna
    Wang, Jun
    [J]. PROCEEDINGS OF THE 2012 IEEE 18TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2012), 2012, : 164 - 171
  • [29] High-Dimensional Similarity Query Processing for Data Science
    Qin, Jianbin
    Wang, Wei
    Xiao, Chuan
    Zhang, Ying
    Wang, Yaoshu
    [J]. KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 4062 - 4063
  • [30] Parallel Map Matching on Massive Vehicle GPS Data Using MapReduce
    Huang, Jian
    Qiao, Shaoqing
    Yu, Haitao
    Qie, Jinhui
    Liu, Chunwei
    [J]. 2013 IEEE 15TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2013 IEEE INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (HPCC_EUC), 2013, : 1498 - 1503