Evaluation of Duplicate Detection Algorithms: From Quality Measures to Test Data Generation

被引:4
|
作者
Panse, Fabian [1 ]
Naumann, Felix [2 ]
机构
[1] Univ Hamburg, Hamburg, Germany
[2] Univ Potsdam, Hasso Plattner Inst, Potsdam, Germany
来源
2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2021) | 2021年
关键词
RECORD LINKAGE;
D O I
10.1109/ICDE51399.2021.00269
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Duplicate detection identifies multiple records in a dataset that represent the same real-world object. Many such approaches exist, both in research and in industry. To investigate essential properties of duplicate detection algorithms, such as their result quality or runtime behavior, they must be executed on suitable test data. The quality evaluation requires that these test data are labeled, constituting a ground truth. Correctly labeled, sizable, and real or at least realistic test datasets, however, are not easy to obtain, creating an obstacle for the advancement of research. In this tutorial, we present common methods to evaluate duplicate detection algorithms and to generate labeled test data. We close with a discussion of open problems.
引用
收藏
页码:2373 / 2376
页数:4
相关论文
共 50 条
  • [31] Comparisons of metaheuristic algorithms and fitness functions on software test data generation
    Sahin, Omur
    Akay, Bahriye
    APPLIED SOFT COMPUTING, 2016, 49 : 1202 - 1214
  • [32] Using Genetic Algorithms in Test Data Generation: A Critical Systematic Mapping
    Rodrigues, Davi Silva
    Delamaro, Marcio Eduardo
    Correa, Cleber Gimenez
    Nunes, Fatima L. S.
    ACM COMPUTING SURVEYS, 2018, 51 (02)
  • [33] Automatic test data generation for program paths using genetic algorithms
    Bueno, PMS
    Jino, M
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2002, 12 (06) : 691 - 709
  • [34] Subspace clustering of data streams: new algorithms and effective evaluation measures
    Hassani, Marwan
    Kim, Yunsu
    Choi, Seungjin
    Seidl, Thomas
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2015, 45 (03) : 319 - 335
  • [35] TDSGen: An environment based on hybrid genetic algorithms for generation of test data
    Ferreira, LP
    Vergilio, SR
    GENETIC AND EVOLUTIONARY COMPUTATION GECCO 2004 , PT 2, PROCEEDINGS, 2004, 3103 : 1431 - 1432
  • [36] Evolutionary algorithms for the multi-objective test data generation problem
    Ferrer, Javier
    Chicano, Francisco
    Alba, Enrique
    SOFTWARE-PRACTICE & EXPERIENCE, 2012, 42 (11): : 1331 - 1362
  • [37] An Adequacy Based Test Data Generation Technique Using Genetic Algorithms
    Malhotra, Ruchika
    Garg, Mohit
    JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2011, 7 (02): : 363 - 384
  • [38] An Evaluation of Differential Evolution in Software Test Data Generation
    Becerra, R. Landa
    Sagarna, R.
    Yao, X.
    2009 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION, VOLS 1-5, 2009, : 2850 - 2857
  • [39] A scheme on automated test data generation and its evaluation
    Chen, JF
    Zhu, L
    Shen, JY
    Wang, ZH
    JOURNAL OF CENTRAL SOUTH UNIVERSITY OF TECHNOLOGY, 2006, 13 (01): : 87 - 92
  • [40] A scheme on automated test data generation and its evaluation
    Ji-feng Chen
    Li Zhu
    Jun-yi Shen
    Zhi-hai Wang
    Journal of Central South University of Technology, 2006, 13 : 87 - 92