An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

被引:0
|
作者
Xi-Te Wang
De-Rong Shen
Mei Bai
Tie-Zheng Nie
Yue Kou
Ge Yu
机构
[1] Northeastern University,College of Information Science and Engineering
关键词
outlier detection; multi-dimensional; distributed; large dataset;
D O I
暂无
中图分类号
学科分类号
摘要
The distance-based outlier is a widely used definition of outlier. A point is distinguished as an outlier on the basis of the distances to its nearest neighbors. In this paper, to solve the problem of outlier computing in distributed environments, DBOZ, a distributed algorithm for distance-based outlier detection using Z-curve hierarchical tree (ZH-tree) is proposed. First, we propose a new index, ZH-tree, to effectively manage the data in a distributed environment. ZH-tree has two desirable advantages, including clustering property to help search the neighbors of a point, and hierarchical structure to support space pruning. We also design a bottom-up approach to build ZH-tree in parallel, whose time complexity is linear to the number of dimensions and the size of dataset. Second, DBOZ is proposed to compute outliers in distributed environments. It consists of two stages. 1) To avoid calculating the exact nearest neighbors of all the points, we design a greedy method and a new ZH-tree based k-nearest neighbor searching algorithm (ZHkNN for short) to obtain a threshold LW. 2) We propose a filter-and-refine approach, which first filters out the unpromising points using LW, and then outputs the final outliers through refining the remaining points. At last, the efficiency and the effectiveness of ZH-tree and DBOZ are testified through a series of experiments.
引用
收藏
页码:1233 / 1248
页数:15
相关论文
共 50 条
  • [1] An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets
    Wang, Xi-Te
    Shen, De-Rong
    Bai, Mei
    Nie, Tie-Zheng
    Kou, Yue
    Yu, Ge
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2015, 30 (06) : 1233 - 1248
  • [2] Density-based Outlier Detection in Multi-dimensional Datasets
    Wang, Xite
    Cao, Zhixin
    Zhan, Rongjuan
    Bai, Mei
    Ma, Qian
    Li, Guanyu
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2022, 16 (12): : 3815 - 3835
  • [3] Visualization of large multi-dimensional datasets
    Welling, J
    Derthick, M
    VIRTUAL OBSERVATORIES OF THE FUTURE, PROCEEDINGS, 2001, 225 : 284 - 290
  • [4] Cell-based outlier detection algorithm: A fast outlier detection algorithm for large datasets
    Wan, You
    Bian, Fuling
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2008, 5012 : 1042 - 1048
  • [5] Outlier Detection for Robust Multi-Dimensional Scaling
    Blouvshtein, Leonid
    Cohen-Or, Daniel
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (09) : 2273 - 2279
  • [6] BOD: An efficient algorithm for distributed outlier detection
    Wang X.-T.
    Shen D.-R.
    Bai M.
    Nie T.-Z.
    Kou Y.
    Yu G.
    1600, Science Press (39): : 36 - 51
  • [7] A distributed algorithm for outlier detection in a large database
    Sarker, BK
    Kitagawa, H
    DATABASES IN NETWORKED INFORMATION SYSTEMS, PROCEEDINGS, 2005, 3433 : 300 - 309
  • [8] An Improved KNN Based Outlier Detection Algorithm for Large Datasets
    Wang, Qian
    Zheng, Min
    ADVANCED DATA MINING AND APPLICATIONS, ADMA 2010, PT I, 2010, 6440 : 585 - 592
  • [9] Outlier detection based on multi-dimensional clustering and local density
    Shou Zhao-yu
    Li Meng-ya
    Li Si-min
    JOURNAL OF CENTRAL SOUTH UNIVERSITY, 2017, 24 (06) : 1299 - 1306
  • [10] Outlier detection based on multi-dimensional clustering and local density
    首照宇
    李萌芽
    李思敏
    Journal of Central South University, 2017, 24 (06) : 1299 - 1306