Statistical hierarchical clustering algorithm for outlier detection in evolving data streams

被引:0
|
作者
Dalibor Krleža
Boris Vrdoljak
Mario Brčić
机构
[1] University of Zagreb,Faculty of Electrical Engineering and Computing
来源
Machine Learning | 2021年 / 110卷
关键词
Big data; Clustering; Anomaly detection; Fraud detection;
D O I
暂无
中图分类号
学科分类号
摘要
Anomaly detection is a hard data analysis process that requires constant creation and improvement of data analysis algorithms. Using traditional clustering algorithms to analyse data streams is impossible due to processing power and memory issues. To solve this, the traditional clustering algorithm complexity needed to be reduced, which led to the creation of sequential clustering algorithms. The usual approach is two-phase clustering, which uses online phase to relax data details and complexity, and offline phase to cluster concepts created in the online phase. Detecting anomalies in a data stream is usually solved in the online phase, as it requires unreduced data. Contrarily, producing good macro-clustering is done in the offline phase, which is the reason why two-phase clustering algorithms have difficulty being equally good in anomaly detection and macro-clustering. In this paper, we propose a statistical hierarchical clustering algorithm equally suitable for both detecting anomalies and macro-clustering. The proposed algorithm is single-phased and uses statistical inference on the input data stream, resulting in statistical distributions that are constantly updated. This makes the classification adaptable, allowing agglomeration of outliers into clusters, tracking population evolution, and to be used without knowing the expected number of clusters and outliers. The proposed algorithm was tested against typical clustering algorithms, including two-phase algorithms suitable for data stream analysis. A number of typical test cases were selected, to show the universality and qualities of the proposed clustering algorithm.
引用
收藏
页码:139 / 184
页数:45
相关论文
共 50 条
  • [41] An Effective Algorithm of Outlier Detection Based on Clustering
    Xia, Qingsong
    Xing, Changzheng
    Li, Na
    [J]. INTERNET OF THINGS-BK, 2012, 312 : 346 - 351
  • [42] Outlier Detection Algorithm Based on Iterative Clustering
    古平
    罗辛
    杨瑞龙
    张程
    [J]. Journal of Donghua University(English Edition), 2015, 32 (04) : 554 - 558
  • [43] Visual interactive evolutionary algorithm for high dimensional data clustering and outlier detection
    Boudjeloud, L
    Poulet, F
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2005, 3518 : 426 - 431
  • [44] Efficient Clustering-Based Outlier Detection Algorithm for Dynamic Data Stream
    Elahi, Manzoor
    Li, Kun
    Nisar, Wasif
    Lv, Xinjie
    Wang, Hongan
    [J]. FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 5, PROCEEDINGS, 2008, : 298 - 304
  • [45] Global High Dimension Outlier Algorithm for Efficient Clustering & Outlier Detection
    Nigam, Nidhi
    Saxena, Tripti
    Richhariya, Vineet
    [J]. 2016 SYMPOSIUM ON COLOSSAL DATA ANALYSIS AND NETWORKING (CDAN), 2016,
  • [46] Online Sparse Representation Clustering for Evolving Data Streams
    Chen, Jie
    Yang, Shengxiang
    Fahy, Conor
    Wang, Zhu
    Guo, Yinan
    Chen, Yingke
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, : 1 - 15
  • [47] Clustering Based Active Learning for Evolving Data Streams
    Ienco, Dino
    Bifet, Albert
    Zliobaite, Indre
    Pfahringer, Bernhard
    [J]. DISCOVERY SCIENCE, 2013, 8140 : 79 - 93
  • [48] Robust Clustering for Tracking Noisy Evolving Data Streams
    Nasraoui, Olfa
    Rojas, Carlos
    [J]. PROCEEDINGS OF THE SIXTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2006, : 619 - 623
  • [49] A statistical μ-partitioning method for clustering data streams
    Park, NH
    Lee, WS
    [J]. COMPUTER AND INFORMATION SCIENCES - ISCIS 2003, 2003, 2869 : 292 - 299
  • [50] Statistical σ-partition clustering over data streams
    Park, NH
    Lee, WS
    [J]. KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2003, PROCEEDINGS, 2003, 2838 : 387 - 398