Distances between data sets based on summary statistics

被引:0
|
作者
Tatti, Nikolaj [1 ]
机构
[1] Aalto Univ, HIIT Basic Res Unit, Lab Comp & Informat Sci, Helsinki, Finland
关键词
data mining theory; complex data; binary data; itemsets;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The concepts of similarity and distance are crucial in data mining. We consider the problem of defining the distance between two data sets by comparing summary statistics computed from the data sets. The initial definition of our distance is based on geometrical notions of certain sets of distributions. We show that this distance can be computed in cubic time and that it has several intuitive properties. We also show that this distance is the unique Mahalanobis distance satisfying certain assumptions. We also demonstrate that if we are dealing with binary data sets, then the distance can be represented naturally by certain parity functions, and that it can be evaluated in linear time. Our empirical tests with real world data show that the distance works well.
引用
收藏
页码:131 / 154
页数:24
相关论文
共 50 条
  • [1] Simple Data Sets for Distinct Basic Summary Statistics
    Lesser, Lawrence
    TEACHING STATISTICS, 2011, 33 (01) : 9 - 11
  • [2] Distances between sets based on set commonality
    Horadam, K. J.
    Nyblom, M. A.
    DISCRETE APPLIED MATHEMATICS, 2014, 167 : 310 - 314
  • [3] Big Data Clustering based on Summary Statistics
    Fu, Junsong
    Liu, Yun
    Zhang, Zhenjiang
    Xiong, Fei
    2015 FIRST INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE THEORY, SYSTEMS AND APPLICATIONS (CCITSA 2015), 2015, : 87 - 91
  • [4] Energy statistics: A class of statistics based on distances
    Szekely, Gabor J.
    Rizzo, Maria L.
    JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2013, 143 (08) : 1249 - 1272
  • [5] Distances between intuitionistic fuzzy sets
    Szmidt, E
    Kacprzyk, J
    FUZZY SETS AND SYSTEMS, 2000, 114 (03) : 505 - 518
  • [6] DISTANCES BETWEEN FUZZY-SETS
    ROSENFELD, A
    PATTERN RECOGNITION LETTERS, 1985, 3 (04) : 229 - 233
  • [7] The statistics of small data sets
    Ball, RO
    Hahn, MW
    SUPERFUND RISK ASSESSMENT IN SOIL CONTAMINATION STUDIES: THIRD VOLUME, 1998, 1338 : 23 - 36
  • [8] Informative and adaptive distances and summary statistics in sequential approximate Bayesian computation
    Schaelte, Yannik
    Hasenauer, Jan
    PLOS ONE, 2023, 18 (05):
  • [9] Pattern recognition based on new distances between intuitionistic fuzzy sets
    Feng, Yu
    Chen, Dongfeng
    Liu, Hui
    MECHATRONICS AND INTELLIGENT MATERIALS II, PTS 1-6, 2012, 490-495 : 412 - 416
  • [10] On the distribution of summary statistics for missing data
    Ringham, B. M.
    Kreidler, S. M.
    Muller, K. E.
    Glueck, D. H.
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2019, 48 (05) : 1149 - 1165