Distances between data sets based on summary statistics

被引:0
|
作者
Tatti, Nikolaj [1 ]
机构
[1] Aalto Univ, HIIT Basic Res Unit, Lab Comp & Informat Sci, Helsinki, Finland
关键词
data mining theory; complex data; binary data; itemsets;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The concepts of similarity and distance are crucial in data mining. We consider the problem of defining the distance between two data sets by comparing summary statistics computed from the data sets. The initial definition of our distance is based on geometrical notions of certain sets of distributions. We show that this distance can be computed in cubic time and that it has several intuitive properties. We also show that this distance is the unique Mahalanobis distance satisfying certain assumptions. We also demonstrate that if we are dealing with binary data sets, then the distance can be represented naturally by certain parity functions, and that it can be evaluated in linear time. Our empirical tests with real world data show that the distance works well.
引用
收藏
页码:131 / 154
页数:24
相关论文
共 50 条
  • [21] Distances between interval-valued fuzzy sets
    Li, Chen
    2009 ANNUAL MEETING OF THE NORTH AMERICAN FUZZY INFORMATION PROCESSING SOCIETY, 2009, : 306 - 308
  • [22] Distances between interval-valued fuzzy sets
    Ju, Hongmei
    Yuan, Xuehai
    PROCEEDINGS OF THE 2007 CONFERENCE ON SYSTEMS SCIENCE, MANAGEMENT SCIENCE AND SYSTEM DYNAMICS: SUSTAINABLE DEVELOPMENT AND COMPLEX SYSTEMS, VOLS 1-10, 2007, : 941 - 948
  • [23] A New Approach to the Distances between Intuitionistic Fuzzy Sets
    Atanassov, Krassimir
    INFORMATION PROCESSING AND MANAGEMENT OF UNCERTAINTY IN KNOWLEDGE-BASED SYSTEMS: THEORY AND METHODS, PT 1, 2010, 80 : 581 - 590
  • [24] Evaluation of statistical treatments of left-censored environmental data using coincident uncensored data sets: I. Summary statistics
    Antweiler, Ronald C.
    Taylor, Howard E.
    ENVIRONMENTAL SCIENCE & TECHNOLOGY, 2008, 42 (10) : 3732 - 3738
  • [25] RBCA compliance statistics for small data sets
    Hahn, MW
    Sevcik, AE
    Ball, RO
    FIRST INTERNATIONAL CONFERENCE ON REMEDIATION OF CHLORINATED AND RECALCITRANT COMPOUNDS, VOL 1: RISK, RESOURCE, AND REGULATORY ISSUES, 1998, : 73 - 78
  • [26] Internet resources on aging: Data sets and statistics
    Post, JA
    GERONTOLOGIST, 1996, 36 (04): : 425 - 429
  • [27] On the Worst Case Data Sets for Order Statistics
    Wang, Lei
    Wang, Xiaodong
    APPLIED MATHEMATICS & INFORMATION SCIENCES, 2012, 6 (02): : 357 - 362
  • [28] Using GCPBayes to explore pleiotropy at gene-level between breast and ovarian cancers based on GWAS summary statistics data
    Asgari, Yazdan
    Sugier, Pierre-Emmanuel
    Baghfalaki, Taban
    Karimi, Mojgan
    Lucotte, Elise
    Ngo, Amelie
    Severi, Gianluca
    Liquet, Benoit
    Truong, Therese
    EUROPEAN JOURNAL OF HUMAN GENETICS, 2023, 31 : 599 - 600
  • [29] Performance Evaluation of Supervised Learning Model Based on Functional Data Analysis and Summary Statistics
    Ju, Yonghan
    Lee, Yung-Seop
    IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, 2025, 38 (01) : 65 - 72
  • [30] USING SUMMARY STATISTICS AS DATA IN ANOVA - A SYSTAT MACRO
    WALSH, JF
    TEACHING OF PSYCHOLOGY, 1991, 18 (04) : 249 - 251