Parallel Continuous Outlier Mining in Streaming Data

被引:8
|
作者
Toliopoulos, Theodoros [1 ]
Gounaris, Anastasios [1 ]
Tsichlas, Kostas [1 ]
Papadopoulos, Apostolos [1 ]
Sampaio, Sandra [2 ]
机构
[1] Aristotle Univ Thessaloniki, Thessaloniki, Greece
[2] Univ Manchester, Manchester, Lancs, England
关键词
anomaly detection; streams; Flink; DISTANCE-BASED OUTLIERS; ALGORITHMS;
D O I
10.1109/DSAA.2018.00033
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this work, we focus on distance-based outliers in a metric space, where the status of an entity as to whether it is an outlier is based on the number of other entities in its neighborhood. In the recent years, several solutions have tackled the problem of distance-based outliers in data streams, where outliers must be mined continuously as new elements become available. An interesting research problem is to combine the streaming environment with massively parallel systems to provide scalable stream-based algorithms. However, none of the previously proposed techniques refer to a massively parallel setting. Our proposal fills this gap and studies transferring state-of-the-art techniques in Apache Flink, a modern platform for intensive streaming analytics. We thoroughly present the technical challenges encountered and the alternatives that may be applied. We show speed-ups up to 117 (resp. 2076) times over a naive parallel (resp. non-parallel) solution in Flink, by using just an ordinary 4-core machine and a real-world dataset. Our results demonstrate that oulier mining can be achieved in an efficient and scalable manner. The resulting techniques have been made publicly available in open-source.
引用
收藏
页码:227 / 236
页数:10
相关论文
共 50 条
  • [1] Continuous outlier mining of streaming data in flink
    Toliopoulos, Theodoros
    Gounaris, Anastasios
    Tsichlas, Kostas
    Papadopoulos, Apostolos
    Sampaio, Sandra
    [J]. INFORMATION SYSTEMS, 2020, 93 (93)
  • [2] Parallel Frequent Itemset Mining on Streaming Data
    He, Yanshan
    Yue, Min
    [J]. 2014 10TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION (ICNC), 2014, : 725 - 730
  • [3] Designing a Streaming Algorithm for Outlier Detection in Data Mining-An Incremental Approach
    Yu, Kangqing
    Shi, Wei
    Santoro, Nicola
    [J]. SENSORS, 2020, 20 (05)
  • [4] Outlier Detection in Streaming Data A research Perspective
    Chugh, Neeraj
    Chugh, Mitali
    Agarwal, Alok
    [J]. 2014 INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING (PDGC), 2014, : 429 - 432
  • [5] Parallel Processing of Dynamic Continuous Queries over Streaming Data Flows
    Deng, Ze
    Wu, Xiaoming
    Wang, Lizhen
    Chen, Xiaodao
    Ranjan, Rajiv
    Zomaya, Albert
    Chen, Dan
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2015, 26 (03) : 834 - 846
  • [6] Feature grouping-based parallel outlier mining of categorical data using spark
    Li, Junli
    Zhang, Jifu
    Qin, Xiao
    Xun, Yaling
    [J]. INFORMATION SCIENCES, 2019, 504 : 1 - 19
  • [7] Outlier Detection Algorithms in Data Mining
    Xi, Jingke
    [J]. 2008 INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY APPLICATION, VOL I, PROCEEDINGS, 2008, : 94 - 97
  • [8] Parallel mining of contextual outlier using sparse subspace
    Zhao, Xujun
    Zhang, Jifu
    Qin, Xiao
    Cai, Jianghui
    Ma, Yang
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2019, 126 : 158 - 170
  • [9] Implementation of Infrastructure for Streaming Outlier Detection in Big Data
    Hasani, Zirije
    [J]. RECENT ADVANCES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 2, 2017, 570 : 503 - 511
  • [10] Mining streaming emerging patterns from streaming data
    Alhammady, Hamad
    [J]. 2007 IEEE/ACS INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, VOLS 1 AND 2, 2007, : 432 - 436