Fast-RCM: Fast Tree-Based Unsupervised Rare-Class Mining

被引:0
|
作者
Weng, Haiqin [1 ]
Ji, Shouling [1 ,2 ,3 ]
Liu, Changchang [4 ]
Wang, Ting [5 ]
He, Qinming [1 ]
Chen, Jianhai [1 ]
机构
[1] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310027, Peoples R China
[2] Zhejiang Univ, Inst Cyberspace Res, Hangzhou 310027, Peoples R China
[3] Zhejiang Univ, Alibaba Zhejiang Univ Joint Inst Frontier Technol, Hangzhou 310027, Peoples R China
[4] IBM Thomas J Watson Res Ctr, Dept Distributed AI, Yorktown Hts, NY 10598 USA
[5] Lehigh Univ, Dept Comp Sci, Bethlehem, PA 18015 USA
关键词
Anomaly detection; Diseases; Vegetation; Approximation algorithms; Time complexity; Computer science; Clustering methods; data mining; tree data structures; CATEGORY DETECTION;
D O I
10.1109/TCYB.2019.2924804
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Rare classes are usually hidden in an imbalanced dataset with the majority of the data examples from major classes. Rare-class mining (RCM) aims at extracting all the data examples belonging to rare classes. Most of the existing approaches for RCM require a certain amount of labeled data examples as input. However, they are ineffective in practice since requesting label information from domain experts is time consuming and human-labor extensive. Thus, we investigate the unsupervised RCM problem, which to the best of our knowledge is the first such attempt. To this end, we propose an efficient algorithm called Fast-RCM for unsupervised RCM, which has an approximately linear time complexity with respect to data size and data dimensionality. Given an unlabeled dataset, Fast-RCM mines out the rare class by first building a rare tree for the input dataset and then extracting data examples of the rare classes based on this rare tree. Compared with the existing approaches which have quadric or even cubic time complexity, Fast-RCM is much faster and can be extended to large-scale datasets. The experimental evaluation on both synthetic and real-world datasets demonstrate that our algorithm can effectively and efficiently extract the rare classes from an unlabeled dataset under the unsupervised settings, and is approximately five times faster than that of the state-of-the-art methods.
引用
收藏
页码:5198 / 5211
页数:14
相关论文
共 50 条
  • [31] Initialization of dynamic time warping using tree-based fast Nearest Neighbor
    Poularakis, Stergios
    Katsavounidis, Ioannis
    PATTERN RECOGNITION LETTERS, 2016, 79 : 31 - 37
  • [32] A tabular pruning rule in tree-based fast nearest neighbor search algorithms
    Oncina, Jose
    Thollard, Franck
    Gomez-Ballester, Eva
    Mico, Luisa
    Moreno-Seco, Francisco
    PATTERN RECOGNITION AND IMAGE ANALYSIS, PT 2, PROCEEDINGS, 2007, 4478 : 306 - +
  • [33] Fast tree-based algorithms for DBSCAN for low-dimensional data on GPUs
    Prokopenko, Andrey
    Lebrun-Grandie, Damien
    Arndt, Daniel
    PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023, 2023, : 503 - 512
  • [34] Ethernet ultra-fast switching: a tree-based local recovery scheme
    Jin, D.
    Li, Y.
    Chen, W.
    Su, L.
    Zeng, L.
    IET COMMUNICATIONS, 2010, 4 (04) : 410 - 418
  • [35] A Fast Algorithm for Mining Rare Itemsets
    Troiano, Luigi
    Scibelli, Giacomo
    Birtolo, Cosimo
    2009 9TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS, 2009, : 1149 - +
  • [36] TeKET: a Tree-Based Unsupervised Keyphrase Extraction Technique
    Gollam Rabby
    Saiful Azad
    Mufti Mahmud
    Kamal Z. Zamli
    Mohammed Mostafizur Rahman
    Cognitive Computation, 2020, 12 : 811 - 833
  • [37] Unsupervised discretization using tree-based density estimation
    Schmidberger, G
    Frank, E
    KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2005, 2005, 3721 : 240 - 251
  • [38] TeKET: a Tree-Based Unsupervised Keyphrase Extraction Technique
    Rabby, Gollam
    Azad, Saiful
    Mahmud, Mufti
    Zamli, Kamal Z.
    Rahman, Mohammed Mostafizur
    COGNITIVE COMPUTATION, 2020, 12 (04) : 811 - 833
  • [39] Data Mining with a Tree-Based Scan Statistic
    Brown, Jeffrey S.
    Dashevsky, Inna
    Fireman, Bruce
    Herrinton, Lisa
    McClure, David
    Murphy, Michael
    Raebel, Marsha
    Sturtevant, Jessica
    Kulldorff, Martin
    PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2011, 20 : S331 - S331
  • [40] Differentially private tree-based redescription mining
    Matej Mihelčić
    Pauli Miettinen
    Data Mining and Knowledge Discovery, 2023, 37 : 1548 - 1590