An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

被引:11
|
作者
Wang, Xi-Te [1 ]
Shen, De-Rong [1 ]
Bai, Mei [1 ]
Nie, Tie-Zheng [1 ]
Kou, Yue [1 ]
Yu, Ge [1 ]
机构
[1] Northeastern Univ, Coll Informat Sci & Engn, Shenyang 110819, Peoples R China
基金
中国国家自然科学基金;
关键词
outlier detection; multi-dimensional; distributed; large dataset; MINING OUTLIERS;
D O I
10.1007/s11390-015-1596-0
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The distance-based outlier is a widely used definition of outlier. A point is distinguished as an outlier on the basis of the distances to its nearest neighbors. In this paper, to solve the problem of outlier computing in distributed environments, DBOZ, a distributed algorithm for distance-based outlier detection using Z-curve hierarchical tree (ZH-tree) is proposed. First, we propose a new index, ZH-tree, to effectively manage the data in a distributed environment. ZH-tree has two desirable advantages, including clustering property to help search the neighbors of a point, and hierarchical structure to support space pruning. We also design a bottom-up approach to build ZH-tree in parallel, whose time complexity is linear to the number of dimensions and the size of dataset. Second, DBOZ is proposed to compute outliers in distributed environments. It consists of two stages. 1) To avoid calculating the exact nearest neighbors of all the points, we design a greedy method and a new ZH-tree based k-nearest neighbor searching algorithm (ZHkNN for short) to obtain a threshold LW. 2) We propose a filter-and-refine approach, which first filters out the unpromising points using LW, and then outputs the final outliers through refining the remaining points. At last, the efficiency and the effectiveness of ZH-tree and DBOZ are testified through a series of experiments.
引用
收藏
页码:1233 / 1248
页数:16
相关论文
共 50 条
  • [1] An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets
    Xi-Te Wang
    De-Rong Shen
    Mei Bai
    Tie-Zheng Nie
    Yue Kou
    Ge Yu
    [J]. Journal of Computer Science and Technology, 2015, 30 : 1233 - 1248
  • [2] Density-based Outlier Detection in Multi-dimensional Datasets
    Wang, Xite
    Cao, Zhixin
    Zhan, Rongjuan
    Bai, Mei
    Ma, Qian
    Li, Guanyu
    [J]. KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2022, 16 (12): : 3815 - 3835
  • [3] Visualization of large multi-dimensional datasets
    Welling, J
    Derthick, M
    [J]. VIRTUAL OBSERVATORIES OF THE FUTURE, PROCEEDINGS, 2001, 225 : 284 - 290
  • [4] Cell-based outlier detection algorithm: A fast outlier detection algorithm for large datasets
    Wan, You
    Bian, Fuling
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2008, 5012 : 1042 - 1048
  • [5] Outlier Detection for Robust Multi-Dimensional Scaling
    Blouvshtein, Leonid
    Cohen-Or, Daniel
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (09) : 2273 - 2279
  • [7] A distributed algorithm for outlier detection in a large database
    Sarker, BK
    Kitagawa, H
    [J]. DATABASES IN NETWORKED INFORMATION SYSTEMS, PROCEEDINGS, 2005, 3433 : 300 - 309
  • [8] An Improved KNN Based Outlier Detection Algorithm for Large Datasets
    Wang, Qian
    Zheng, Min
    [J]. ADVANCED DATA MINING AND APPLICATIONS, ADMA 2010, PT I, 2010, 6440 : 585 - 592
  • [9] Outlier detection based on multi-dimensional clustering and local density
    Shou Zhao-yu
    Li Meng-ya
    Li Si-min
    [J]. JOURNAL OF CENTRAL SOUTH UNIVERSITY, 2017, 24 (06) : 1299 - 1306
  • [10] Outlier detection based on multi-dimensional clustering and local density
    首照宇
    李萌芽
    李思敏
    [J]. Journal of Central South University, 2017, 24 (06) : 1299 - 1306