Outlier mining in large high-dimensional data sets

被引:228
|
作者
Angiulli, F [1 ]
Pizzuti, C [1 ]
机构
[1] Italian Natl Res Council, Inst High Performance Comp & Networking, CNR, ICAR, I-87036 Arcavacata Di Rende, CS, Italy
关键词
outlier mining; space-filling curves;
D O I
10.1109/TKDE.2005.31
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, a new definition of distance-based outlier and an algorithm, called HilOut, designed to efficiently detect the top n outliers of a large and high-dimensional data set are proposed. Given an integer k, the weight of a point is defined as the sum of the distances separating it from its k nearest-neighbors. Outlier are those points scoring the largest values of weight. The algorithm HilOut makes use of the notion of space-filling curve to linearize the data set, and it consists of two phases. The first phase provides an approximate solution, within a rough factor, after the execution of at most d+1 sorts and scans of the data set, with temporal cost quadratic in d and linear in N and in k, where d is the number of dimensions of the data set and N is the number of points in the data set. During this phase, the algorithm isolates points candidate to be outliers and reduces this set at each iteration. If the size of this set becomes n, then the algorithm stops reporting the exact solution. The second phase calculates the exact solution with a final scan examining further the candidate outliers that remained after the first phase. Experimental results show that the algorithm always stops, reporting the exact solution, during the first phase after much less than d + 1 steps. We present both an in-memory and disk-based implementation of the HilOut algorithm and a thorough scaling analysis for real and synthetic data sets showing that the algorithm scales well in both cases.
引用
收藏
页码:203 / 215
页数:13
相关论文
共 50 条
  • [31] DACC: A Data Exploration Method for High-Dimensional Data Sets
    Zhao, Qingnan
    Li, Hui
    Chen, Mei
    Dai, Zhenyu
    Zhu, Ming
    [J]. ARTIFICIAL INTELLIGENCE AND ALGORITHMS IN INTELLIGENT SYSTEMS, 2019, 764 : 219 - 229
  • [32] Outlier detection in large data sets
    Buzzi-Ferraris, Guido
    Manenti, Flavio
    [J]. COMPUTERS & CHEMICAL ENGINEERING, 2011, 35 (02) : 388 - 390
  • [33] A Valid Clustering Algorithm for High-dimensional Large Data Sets Based on Distributed Method
    Guo Xian e
    Yan Junmei
    [J]. PROCEEDINGS OF 2009 INTERNATIONAL WORKSHOP ON INFORMATION SECURITY AND APPLICATION, 2009, : 1 - 6
  • [34] SPARSE LEAST TRIMMED SQUARES REGRESSION FOR ANALYZING HIGH-DIMENSIONAL LARGE DATA SETS
    Alfons, Andreas
    Croux, Christophe
    Gelper, Sarah
    [J]. ANNALS OF APPLIED STATISTICS, 2013, 7 (01): : 226 - 248
  • [35] Approximate single linkage cluster analysis of large data sets in high-dimensional spaces
    Eddy, WF
    Mockus, A
    Oue, SG
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 1996, 23 (01) : 29 - 43
  • [36] Sparse Kernel Clustering of Massive High-Dimensional Data sets with Large Number of Clusters
    Chitta, Radha
    Jain, Anil K.
    Jin, Rong
    [J]. PIKM'15: PROCEEDINGS OF THE 8TH PH.D. WORKSHOP IN INFORMATION AND KNOWLEDGE MANAGEMENT, 2015, : 11 - 18
  • [37] Dimensionality Reduction for Registration of High-Dimensional Data Sets
    Xu, Min
    Chen, Hao
    Varshney, Pramod K.
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2013, 22 (08) : 3041 - 3049
  • [38] OUTLIER DETECTION WITH ENHANCED ANGLE-BASED OUTLIER FACTOR IN HIGH-DIMENSIONAL DATA STREAM
    Shou, Zhaoyu
    Tian, Hao
    Li, Simin
    Zou, Fengbo
    [J]. INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2018, 14 (05): : 1633 - 1651
  • [39] An efficient clustering method of data mining for high-dimensional data
    Chang, JW
    Kang, HM
    [J]. 8TH WORLD MULTI-CONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL II, PROCEEDINGS: COMPUTING TECHNIQUES, 2004, : 273 - 278
  • [40] A hybrid dimensionality reduction method for outlier detection in high-dimensional data
    Meng, Guanglei
    Wang, Biao
    Wu, Yanming
    Zhou, Mingzhe
    Meng, Tiankuo
    [J]. INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2023, 14 (11) : 3705 - 3718