Statistical hierarchical clustering algorithm for outlier detection in evolving data streams

被引:20
|
作者
Krleza, Dalibor [1 ]
Vrdoljak, Boris [1 ]
Brcic, Mario [1 ]
机构
[1] Univ Zagreb, Fac Elect Engn & Comp, Unska 3, Zagreb, Croatia
关键词
Big data; Clustering; Anomaly detection; Fraud detection;
D O I
10.1007/s10994-020-05905-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Anomaly detection is a hard data analysis process that requires constant creation and improvement of data analysis algorithms. Using traditional clustering algorithms to analyse data streams is impossible due to processing power and memory issues. To solve this, the traditional clustering algorithm complexity needed to be reduced, which led to the creation of sequential clustering algorithms. The usual approach is two-phase clustering, which usesonlinephase to relax data details and complexity, andofflinephase to cluster concepts created in theonlinephase. Detecting anomalies in a data stream is usually solved in theonlinephase, as it requires unreduced data. Contrarily, producing good macro-clustering is done in theofflinephase, which is the reason why two-phase clustering algorithms have difficulty being equally good in anomaly detection and macro-clustering. In this paper, we propose a statistical hierarchical clustering algorithm equally suitable for both detecting anomalies and macro-clustering. The proposed algorithm is single-phased and uses statistical inference on the input data stream, resulting in statistical distributions that are constantly updated. This makes the classification adaptable, allowing agglomeration of outliers into clusters, tracking population evolution, and to be used without knowing the expected number of clusters and outliers. The proposed algorithm was tested against typical clustering algorithms, including two-phase algorithms suitable for data stream analysis. A number of typical test cases were selected, to show the universality and qualities of the proposed clustering algorithm.
引用
收藏
页码:139 / 184
页数:46
相关论文
共 50 条
  • [31] A Survey of Outlier Detection Algorithms for Data Streams
    Tamboli, Jinita
    Shukla, Madhu
    [J]. PROCEEDINGS OF THE 10TH INDIACOM - 2016 3RD INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT, 2016, : 3535 - 3540
  • [32] Outlier and anomaly pattern detection on data streams
    Cheong Hee Park
    [J]. The Journal of Supercomputing, 2019, 75 : 6118 - 6128
  • [33] Outlier and anomaly pattern detection on data streams
    Park, Cheong Hee
    [J]. JOURNAL OF SUPERCOMPUTING, 2019, 75 (09): : 6118 - 6128
  • [34] Attribute Outlier Detection over Data Streams
    Cao, Hui
    Zhou, Yongluan
    Shou, Lidan
    Chen, Gang
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PT II, PROCEEDINGS, 2010, 5982 : 216 - +
  • [35] Trajectory Outlier Detection on Trajectory Data Streams
    Cao, Keyan
    Liu, Yefan
    Meng, Gongjie
    Liu, Haoli
    Miao, Anchen
    Xu, Jingke
    [J]. IEEE Access, 2020, 8 : 34187 - 34196
  • [36] Incremental local outlier detection for data streams
    Pokrajac, Dragojub
    Lazarevic, Aleksandar
    Latecki, Longin Jan
    [J]. 2007 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DATA MINING, VOLS 1 AND 2, 2007, : 504 - 515
  • [37] Continuous Outlier Detection on Uncertain Data Streams
    Shaikh, Salman Ahmed
    Kitagawa, Hiroyuki
    [J]. 2014 IEEE NINTH INTERNATIONAL CONFERENCE ON INTELLIGENT SENSORS, SENSOR NETWORKS AND INFORMATION PROCESSING (IEEE ISSNIP 2014), 2014,
  • [38] Outlier detection over data streams: Survey
    Brahmi, Zaki
    Souiden, Imen
    [J]. International Journal of Business Intelligence and Data Mining, 2021, 19 (04) : 481 - 507
  • [39] Trajectory Outlier Detection on Trajectory Data Streams
    Cao, Keyan
    Liu, Yefan
    Meng, Gongjie
    Liu, Haoli
    Miao, Anchen
    Xu, Jingke
    [J]. IEEE ACCESS, 2020, 8 : 34187 - 34196
  • [40] A Practical Algorithm for Distributed Clustering and Outlier Detection
    Chen, Jiecao
    Azer, Erfan Sadeqi
    Zhang, Qin
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31