Outlier mining in large high-dimensional data sets

被引:228
|
作者
Angiulli, F [1 ]
Pizzuti, C [1 ]
机构
[1] Italian Natl Res Council, Inst High Performance Comp & Networking, CNR, ICAR, I-87036 Arcavacata Di Rende, CS, Italy
关键词
outlier mining; space-filling curves;
D O I
10.1109/TKDE.2005.31
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, a new definition of distance-based outlier and an algorithm, called HilOut, designed to efficiently detect the top n outliers of a large and high-dimensional data set are proposed. Given an integer k, the weight of a point is defined as the sum of the distances separating it from its k nearest-neighbors. Outlier are those points scoring the largest values of weight. The algorithm HilOut makes use of the notion of space-filling curve to linearize the data set, and it consists of two phases. The first phase provides an approximate solution, within a rough factor, after the execution of at most d+1 sorts and scans of the data set, with temporal cost quadratic in d and linear in N and in k, where d is the number of dimensions of the data set and N is the number of points in the data set. During this phase, the algorithm isolates points candidate to be outliers and reduces this set at each iteration. If the size of this set becomes n, then the algorithm stops reporting the exact solution. The second phase calculates the exact solution with a final scan examining further the candidate outliers that remained after the first phase. Experimental results show that the algorithm always stops, reporting the exact solution, during the first phase after much less than d + 1 steps. We present both an in-memory and disk-based implementation of the HilOut algorithm and a thorough scaling analysis for real and synthetic data sets showing that the algorithm scales well in both cases.
引用
收藏
页码:203 / 215
页数:13
相关论文
共 50 条
  • [1] On eigenfunction approach to data mining: outlier detection in high-dimensional data sets
    Nagar, AK
    Muyeba, MK
    [J]. 8TH WORLD MULTI-CONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL II, PROCEEDINGS: COMPUTING TECHNIQUES, 2004, : 251 - 256
  • [2] Ordinal Outlier Algorithm for Anomaly Detection of High-Dimensional Data Sets
    Chen, Gang
    Du, Linlin
    An, Baoran
    [J]. PROCEEDINGS OF THE 32ND 2020 CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2020), 2020, : 5356 - 5361
  • [3] Outlier detection for high-dimensional data
    Ro, Kwangil
    Zou, Changliang
    Wang, Zhaojun
    Yin, Guosheng
    [J]. BIOMETRIKA, 2015, 102 (03) : 589 - 599
  • [4] Outlier mining based on Variance of Angle technology research in High-Dimensional Data
    Liu, Wenting
    Pan, Ruikai
    [J]. 2015 10TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND KNOWLEDGE ENGINEERING (ISKE), 2015, : 598 - 603
  • [5] Intrinsic dimensional outlier detection in high-dimensional data
    Von Brünken, Jonathan
    Houle, Michael E.
    Zimek, Arthur
    [J]. NII Technical Reports, 2015, (03): : 1 - 12
  • [6] GAUSSIAN PROCESSES FOR HIGH-DIMENSIONAL, LARGE DATA SETS: A REVIEW
    Jiang, Mengrui
    Pedrielli, Giulia
    Szu Hui Ng
    [J]. 2022 WINTER SIMULATION CONFERENCE (WSC), 2022, : 49 - 60
  • [7] Efficient Outlier Detection for High-Dimensional Data
    Liu, Huawen
    Li, Xuelong
    Li, Jiuyong
    Zhang, Shichao
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2018, 48 (12): : 2451 - 2461
  • [8] A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes
    Koufakou, Anna
    Georgiopoulos, Michael
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2010, 20 (02) : 259 - 289
  • [9] A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes
    Anna Koufakou
    Michael Georgiopoulos
    [J]. Data Mining and Knowledge Discovery, 2010, 20 : 259 - 289
  • [10] Mining High-Dimensional CyTOF Data: Concurrent Gating, Outlier Removal, and Dimension Reduction
    Lee, Sharon X.
    [J]. DATABASES THEORY AND APPLICATIONS, ADC 2017, 2017, 10538 : 178 - 189