Relationship-based clustering and visualization for high-dimensional data mining

被引:59
|
作者
Strehl, A [1 ]
Ghosh, J [1 ]
机构
[1] Univ Texas, Dept Elect & Comp Engn, Austin, TX 78712 USA
关键词
cluster analysis; graph partitioning; high dimensional; visualization; retail customers; text mining; web-log analysis;
D O I
10.1287/ijoc.15.2.208.14448
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In several real-life data-mining applications, data reside in very high (1000 or more) dimensional space, where both clustering techniques developed for low-dimensional spaces (k-means, BIRCH, CLARANS, CURE, DBScan, etc.) as well as visualization methods such as parallel coordinates or projective visualizations, are rendered ineffective. This paper proposes a relationship-based approach that alleviates both problems, side-stepping the "curse-of-dimensionality" issue by working in a suitable similarity space instead of the original high-dimensional attribute space. This intermediary similarity space can be suitably tailored to satisfy business criteria such as requiring customer clusters to represent comparable amounts of revenue. We apply efficient and scalable graph-partitioning-based clustering techniques in this space. The output from the clustering algorithm is used to re-order the data points so that the resulting permuted similarity matrix can be readily visualized in two dimensions, with clusters showing up as bands. While two-dimensional visualization of a similarity matrix is by itself not novel, its combination with the order-sensitive partitioning of a graph that captures the relevant similarity measure between objects provides three powerful properties: (i) the high-dimensionality of the data does not affect further processing once the similarity space is formed; (ii) it leads to clusters of (approximately) equal importance, and (iii) related clusters show up adjacent to one another, further facilitating the visualization of results. The visualization is very helpful for assessing and improving clustering. For example, actionable recommendations for splitting or merging of clusters can be easily derived, and it also guides the user toward the right number of clusters. Results are presented on a real retail industry dataset of several thousand customers and products, as well as on clustering of web-document collections and of web-log sessions.
引用
收藏
页码:208 / 230
页数:23
相关论文
共 50 条
  • [1] Visualization and data mining of high-dimensional data
    Inselberg, A
    [J]. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2002, 60 (1-2) : 147 - 159
  • [2] Network-based Clustering and Embedding for High-Dimensional Data Visualization
    Zhang, Hengyuan
    Chen, Xiaowu
    [J]. 2013 INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN AND COMPUTER GRAPHICS (CAD/GRAPHICS), 2013, : 290 - 297
  • [3] An efficient clustering method of data mining for high-dimensional data
    Chang, JW
    Kang, HM
    [J]. 8TH WORLD MULTI-CONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL II, PROCEEDINGS: COMPUTING TECHNIQUES, 2004, : 273 - 278
  • [4] An efficient clustering method for high-dimensional data mining
    Chang, JW
    Kim, YK
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE - SBIA 2004, 2004, 3171 : 276 - 285
  • [5] High-dimensional clustering method for high performance data mining
    Chang, Jae-Woo
    Lee, Hyun-Jo
    [J]. COMPUTATIONAL SCIENCE - ICCS 2007, PT 3, PROCEEDINGS, 2007, 4489 : 621 - +
  • [6] Data Mining and Visualization of High-Dimensional ICME Data for Additive Manufacturing
    Kannan, Rangasayee
    Knapp, Gerald L.
    Nandwana, Peeyush
    Dehoff, Ryan
    Plotkowski, Alex
    Stump, Benjamin
    Yang, Ying
    Paquit, Vincent
    [J]. INTEGRATING MATERIALS AND MANUFACTURING INNOVATION, 2022, 11 (01) : 57 - 70
  • [7] Data Mining and Visualization of High-Dimensional ICME Data for Additive Manufacturing
    Rangasayee Kannan
    Gerald L. Knapp
    Peeyush Nandwana
    Ryan Dehoff
    Alex Plotkowski
    Benjamin Stump
    Ying Yang
    Vincent Paquit
    [J]. Integrating Materials and Manufacturing Innovation, 2022, 11 : 57 - 70
  • [8] CLINCH: Clustering incomplete high-dimensional data for data mining application
    Cheng, ZP
    Zhou, D
    Wang, C
    Guo, JK
    Wang, W
    Ding, BK
    Shi, B
    [J]. WEB TECHNOLOGIES RESEARCH AND DEVELOPMENT - APWEB 2005, 2005, 3399 : 88 - 99
  • [9] Clustering High-Dimensional Stock Data using Data Mining Approach
    Indriyanti, Dhea
    Dhini, Arian
    [J]. 2019 16TH INTERNATIONAL CONFERENCE ON SERVICE SYSTEMS AND SERVICE MANAGEMENT (ICSSSM2019), 2019,
  • [10] High-dimensional data visualization
    Tang, Lin
    [J]. NATURE METHODS, 2020, 17 (02) : 129 - 129