YADING: Fast Clustering of Large-Scale Time Series Data

被引:58
|
作者
Ding, Rui [1 ]
Wang, Qiang [1 ]
Dang, Yingnong [1 ]
Fu, Qiang [1 ]
Zhang, Haidong [1 ]
Zhang, Dongmei [1 ]
机构
[1] Microsoft Res, Beijing, Peoples R China
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2015年 / 8卷 / 05期
关键词
D O I
10.14778/2735479.2735481
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Fast and scalable analysis techniques are becoming increasingly important in the era of big data, because they are the enabling techniques to create real-time and interactive experiences in data analysis. Time series are widely available in diverse application areas. Due to the large number of time series instances (e.g., millions) and the high dimensionality of each time series instance (e.g., thousands), it is challenging to conduct clustering on large-scale time series, and it is even more challenging to do so in real-time to support interactive exploration. In this paper, we propose a novel end-to-end time series clustering algorithm, YADING, which automatically clusters large-scale time series with fast performance and quality results. Specifically, YADING consists of three steps: sampling the input dataset, conducting clustering on the sampled dataset, and assigning the rest of the input data to the clusters generated on the sampled dataset. In particular, we provide theoretical proof on the lower and upper bounds of the sample size, which not only guarantees YADING's high performance, but also ensures the distribution consistency between the input dataset and the sampled dataset. We also select L-1 norm as similarity measure and the multi-density approach as the clustering method. With theoretical bound, this selection ensures YADING's robustness to time series variations due to phase perturbation and random noise. Evaluation results have demonstrated that on typical-scale (100,000 time series each with 1,000 dimensions) datasets, YADING is about 40 times faster than the state-of-the-art, sampling-based clustering algorithm DENCLUE 2.0, and about 1,000 times faster than DBSCAN and CLARANS. YADING has also been used by product teams at Microsoft to analyze service performance. Two of such use cases are shared in this paper.
引用
收藏
页码:473 / 484
页数:12
相关论文
共 50 条
  • [1] A Fast Semi-Supervised Clustering Framework for Large-Scale Time Series Data
    He, Guoliang
    Pan, Yanzhou
    Xia, Xuewen
    He, Jinrong
    Peng, Rong
    Xiong, Neal N.
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2021, 51 (07): : 4201 - 4216
  • [2] LARGE-SCALE TIME SERIES CLUSTERING WITH k-ARs
    Yue, Zuogong
    Solo, Victor
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6044 - 6048
  • [3] Fast Large-Scale Trajectory Clustering
    Wang, Sheng
    Bao, Zhifeng
    Culpepper, J. Shane
    Sellis, Timos
    Qin, Xiaolin
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 13 (01): : 29 - 42
  • [4] Fast and scalable support vector clustering for large-scale data analysis
    Yuan Ping
    Yun Feng Chang
    Yajian Zhou
    Ying Jie Tian
    Yi Xian Yang
    Zhili Zhang
    [J]. Knowledge and Information Systems, 2015, 43 : 281 - 310
  • [5] Fast and scalable support vector clustering for large-scale data analysis
    Ping, Yuan
    Chang, Yun Feng
    Zhou, Yajian
    Tian, Ying Jie
    Yang, Yi Xian
    Zhang, Zhili
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 43 (02) : 281 - 310
  • [6] KNN-BLOCK DBSCAN: Fast Clustering for Large-Scale Data
    Chen, Yewang
    Zhou, Lida
    Pei, Songwen
    Yu, Zhiwen
    Chen, Yi
    Liu, Xin
    Du, Jixiang
    Xiong, Naixue
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2021, 51 (06): : 3939 - 3953
  • [7] Large-Scale Time Series Clustering Based on Fuzzy Granulation and Collaboration
    Wang, Xiao
    Yu, Fusheng
    Zhang, Huixin
    Liu, Shihu
    Wang, Jiayin
    [J]. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2015, 30 (06) : 763 - 780
  • [8] Granulation-based Fuzzy Clustering of Large-scale Time Series
    Wang, Xiao
    Yu, Fusheng
    Zhang, Huixin
    [J]. 2013 10TH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (FSKD), 2013, : 466 - 471
  • [9] Large-scale parallel data clustering
    Judd, D
    McKinley, PK
    Jain, AK
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1998, 20 (08) : 871 - 876
  • [10] HGC: fast hierarchical clustering for large-scale single-cell data
    Zou, Ziheng
    Hua, Kui
    Zhang, Xuegong
    [J]. BIOINFORMATICS, 2021, 37 (21) : 3964 - 3965