Distances between data sets based on summary statistics

被引:0
|
作者
Tatti, Nikolaj [1 ]
机构
[1] Aalto Univ, HIIT Basic Res Unit, Lab Comp & Informat Sci, Helsinki, Finland
关键词
data mining theory; complex data; binary data; itemsets;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The concepts of similarity and distance are crucial in data mining. We consider the problem of defining the distance between two data sets by comparing summary statistics computed from the data sets. The initial definition of our distance is based on geometrical notions of certain sets of distributions. We show that this distance can be computed in cubic time and that it has several intuitive properties. We also show that this distance is the unique Mahalanobis distance satisfying certain assumptions. We also demonstrate that if we are dealing with binary data sets, then the distance can be represented naturally by certain parity functions, and that it can be evaluated in linear time. Our empirical tests with real world data show that the distance works well.
引用
收藏
页码:131 / 154
页数:24
相关论文
共 50 条
  • [41] Order statistics and estimating cardinalities of massive data sets
    Giroire, Frederic
    DISCRETE APPLIED MATHEMATICS, 2009, 157 (02) : 406 - 427
  • [42] Integrating eQTL data with GWAS summary statistics in pathway-based analysis with application to schizophrenia
    Wu, Chong
    Pan, Wei
    GENETIC EPIDEMIOLOGY, 2018, 42 (03) : 303 - 316
  • [43] Extreme wave statistics from radar data sets
    Lehner, S
    Günther, H
    Rosenthal, W
    IGARSS 2004: IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM PROCEEDINGS, VOLS 1-7: SCIENCE FOR SOCIETY: EXPLORING AND MANAGING A CHANGING PLANET, 2004, : 1880 - 1883
  • [44] SUMMARY STATISTICS FOR 5 YEARS OF THE MARC DATA-BASE
    WILLIAMS, ME
    BARTH, SW
    PREECE, SE
    JOURNAL OF LIBRARY AUTOMATION, 1979, 12 (04): : 314 - 337
  • [45] NUMERICAL AND GRAPHICAL DATA SUMMARY USING O-STATISTICS
    KAIGH, WD
    DRISCOLL, MF
    AMERICAN STATISTICIAN, 1987, 41 (01): : 25 - 32
  • [46] New standards for summary statistics data shared in the GWAS Catalog
    Harris, Laura
    Hayhurst, James
    Buniello, Annalisa
    Abid, Ala
    Cerezo, Maria
    Ji, Yue
    John, Sajo
    Lambert, Samuel
    Lewis, Elizabeth
    McMahon, Aoife
    Mosaku, Abayomi
    Ramachandran, Santhi
    Sollis, Elliot
    MacArthur, Jacqueline
    Cunningham, Fiona
    Hindorff, Lucia
    Inouye, Michael
    Wiley, Ken
    Barroso, Ines
    Parkinson, Helen
    EUROPEAN JOURNAL OF HUMAN GENETICS, 2024, 32 : 785 - 786
  • [47] Nonparametric ROC summary statistics for correlated diagnostic marker data
    Tang, Liansheng Larry
    Liu, Aiyi
    Chen, Zhen
    Schisterman, Enrique F.
    Zhang, Bo
    Miao, Zhuang
    STATISTICS IN MEDICINE, 2013, 32 (13) : 2209 - 2220
  • [48] CHECKING THE VALIDITY OF SUMMARY STATISTICS FROM HEGIS FINANCIAL DATA
    PATRICK, C
    COLLIER, DJ
    NEW DIRECTIONS FOR HIGHER EDUCATION, 1979, (26) : 75 - 80
  • [49] Evaluating Causal Relationship Between Metabolites and Six Cardiovascular Diseases Based on GWAS Summary Statistics
    Qiao, Jiahao
    Zhang, Meng
    Wang, Ting
    Huang, Shuiping
    Zeng, Ping
    FRONTIERS IN GENETICS, 2021, 12
  • [50] New Hausdorff distances based on robust statistics for comparing images
    Kwon, OK
    Sim, DG
    Park, RH
    INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, PROCEEDINGS - VOL III, 1996, : 21 - 24