Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions

被引:7
|
作者
Karlas, Bojan [1 ]
Li, Peng [2 ]
Wu, Renzhi [2 ]
Gurel, Nezihe Merve [1 ]
Chu, Xu [2 ]
Wu, Wentao [3 ]
Zhang, Ce [1 ]
机构
[1] Swiss Fed Inst Technol, Zurich, Switzerland
[2] Georgia Inst Technol, Atlanta, GA 30332 USA
[3] Microsoft Res, Redmond, WA USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2020年 / 14卷 / 03期
基金
欧盟地平线“2020”; 瑞士国家科学基金会;
关键词
MULTIPLE IMPUTATION; MISSING DATA;
D O I
10.14778/3430915.3430917
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in realworld datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables, which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of "Certain Predictions" (CP) - a test data example can be certainly predicted (CP'ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP'ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed - we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds. We demonstrate one example use case of CP in the important application of "data cleaning for machine learning (DC for ML)." We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques, particularly on datasets with systematic missing values. For example, on 5 datasets with systematic missingness, CPClean (with early termination) closes 100% gap on average by cleaning 36% of dirty data on average, while the best automatic cleaning approach BoostClean can only close 14% gap on average.
引用
收藏
页码:255 / 267
页数:13
相关论文
共 50 条
  • [1] Optimizing the Computation of Approximate Certain Query Answers over Incomplete Databases
    Fiorentino, Nicola
    Molinar, Cristian
    Trubitsyna, Irina
    [J]. FLEXIBLE QUERY ANSWERING SYSTEMS, 2019, 11529 : 48 - 60
  • [2] Certain Answers over Incomplete XML Documents: Extending Tractability Boundary
    Amélie Gheerbrant
    Leonid Libkin
    [J]. Theory of Computing Systems, 2015, 57 : 892 - 926
  • [3] Computing possible and certain answers over order-incomplete data
    Amarilli, Antoine
    Ba, Mouhamadou Lamine
    Deutch, Daniel
    Senellart, Pierre
    [J]. THEORETICAL COMPUTER SCIENCE, 2019, 797 : 42 - 76
  • [4] Certain Answers over Incomplete XML Documents: Extending Tractability Boundary
    Gheerbrant, Amelie
    Libkin, Leonid
    [J]. THEORY OF COMPUTING SYSTEMS, 2015, 57 (04) : 892 - 926
  • [5] ACID: A System for Computing Approximate Certain Query Answers over Incomplete Databases
    Fiorentino, Nicola
    Greco, Sergio
    Molinaro, Cristian
    Trubitsyna, Irina
    [J]. SIGMOD'18: PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2018, : 1685 - 1688
  • [6] On Convergence of Nearest Neighbor Classifiers over Feature Transformations
    Rimanic, Luka
    Renggli, Cedric
    Li, Bo
    Zhang, Ce
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [7] CORRELATIONS OF NEAREST-NEIGHBOR BONDS IN CERTAIN POLY(DIALKYLSILOXANES)
    NEUBURGER, NA
    MATTICE, WL
    BAHAR, I
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 1992, 203 : 35 - IEC
  • [8] On the coNP hardness of computing certain answers over locally specified incomplete DOM-trees
    Amano, Shun'ichi
    [J]. INFORMATION PROCESSING LETTERS, 2010, 110 (17) : 753 - 756
  • [9] Querying Incomplete Numerical Data: Between Certain and Possible Answers
    Console, Marco
    Libkin, Leonid
    Peterfreund, Liat
    [J]. PROCEEDINGS OF THE 42ND ACM SIGMOD-SIGACT-SIGAI SYMPOSIUM ON PRINCIPLES OF DATABASE SYSTEMS, PODS 2023, 2023, : 349 - 358
  • [10] CERTAIN MINIMAX CONTROL PROBLEMS WITH INCOMPLETE INFORMATION
    MELIKIAN, AA
    CHERNOUS.FL
    [J]. JOURNAL OF APPLIED MATHEMATICS AND MECHANICS-USSR, 1971, 35 (06): : 907 - &