YADING: Fast Clustering of Large-Scale Time Series Data

被引：58

作者：

Ding, Rui ^{[1
]}

Wang, Qiang ^{[1
]}

Dang, Yingnong ^{[1
]}

Fu, Qiang ^{[1
]}

Zhang, Haidong ^{[1
]}

Zhang, Dongmei ^{[1
]}

机构：

[1] Microsoft Res, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE VLDB ENDOWMENT | 2015年 / 8卷 / 05期

关键词：

D O I：

10.14778/2735479.2735481

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Fast and scalable analysis techniques are becoming increasingly important in the era of big data, because they are the enabling techniques to create real-time and interactive experiences in data analysis. Time series are widely available in diverse application areas. Due to the large number of time series instances (e.g., millions) and the high dimensionality of each time series instance (e.g., thousands), it is challenging to conduct clustering on large-scale time series, and it is even more challenging to do so in real-time to support interactive exploration. In this paper, we propose a novel end-to-end time series clustering algorithm, YADING, which automatically clusters large-scale time series with fast performance and quality results. Specifically, YADING consists of three steps: sampling the input dataset, conducting clustering on the sampled dataset, and assigning the rest of the input data to the clusters generated on the sampled dataset. In particular, we provide theoretical proof on the lower and upper bounds of the sample size, which not only guarantees YADING's high performance, but also ensures the distribution consistency between the input dataset and the sampled dataset. We also select L-1 norm as similarity measure and the multi-density approach as the clustering method. With theoretical bound, this selection ensures YADING's robustness to time series variations due to phase perturbation and random noise. Evaluation results have demonstrated that on typical-scale (100,000 time series each with 1,000 dimensions) datasets, YADING is about 40 times faster than the state-of-the-art, sampling-based clustering algorithm DENCLUE 2.0, and about 1,000 times faster than DBSCAN and CLARANS. YADING has also been used by product teams at Microsoft to analyze service performance. Two of such use cases are shared in this paper.

引用

页码：473 / 484

页数：12

共 50 条

[1] A Fast Semi-Supervised Clustering Framework for Large-Scale Time Series Data
He, Guoliang
Pan, Yanzhou
Xia, Xuewen
He, Jinrong
Peng, Rong
Xiong, Neal N.
[J]. IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2021, 51 (07): : 4201 - 4216
[2] LARGE-SCALE TIME SERIES CLUSTERING WITH k-ARs
Yue, Zuogong
Solo, Victor
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6044 - 6048
[3] Fast Large-Scale Trajectory Clustering
Wang, Sheng
Bao, Zhifeng
Culpepper, J. Shane
Sellis, Timos
Qin, Xiaolin
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 13 (01): : 29 - 42
[4] Fast and scalable support vector clustering for large-scale data analysis
Yuan Ping
Yun Feng Chang
Yajian Zhou
Ying Jie Tian
Yi Xian Yang
Zhili Zhang
[J]. Knowledge and Information Systems, 2015, 43 : 281 - 310
[5] Fast and scalable support vector clustering for large-scale data analysis
Ping, Yuan
Chang, Yun Feng
Zhou, Yajian
Tian, Ying Jie
Yang, Yi Xian
Zhang, Zhili
[J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 43 (02) : 281 - 310
[6] KNN-BLOCK DBSCAN: Fast Clustering for Large-Scale Data
Chen, Yewang
Zhou, Lida
Pei, Songwen
Yu, Zhiwen
Chen, Yi
Liu, Xin
Du, Jixiang
Xiong, Naixue
[J]. IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2021, 51 (06): : 3939 - 3953
[7] Large-Scale Time Series Clustering Based on Fuzzy Granulation and Collaboration
Wang, Xiao
Yu, Fusheng
Zhang, Huixin
Liu, Shihu
Wang, Jiayin
[J]. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2015, 30 (06) : 763 - 780
[8] Granulation-based Fuzzy Clustering of Large-scale Time Series
Wang, Xiao
Yu, Fusheng
Zhang, Huixin
[J]. 2013 10TH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (FSKD), 2013, : 466 - 471
[9] Large-scale parallel data clustering
Judd, D
McKinley, PK
Jain, AK
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1998, 20 (08) : 871 - 876
[10] HGC: fast hierarchical clustering for large-scale single-cell data
Zou, Ziheng
Hua, Kui
Zhang, Xuegong
[J]. BIOINFORMATICS, 2021, 37 (21) : 3964 - 3965

← 1 2 3 4 5 →