Schema-agnostic vs Schema-based Configurations for Blocking Methods on Homogeneous Data

被引:0
|
作者
Papadakis, George [1 ,3 ]
Alexiou, George [2 ]
Papastefanatos, George [2 ]
Koutrika, Georgia
机构
[1] HP Labs, Palo Alto, CA USA
[2] Res Ctr Athena, IMIS, Athens, Greece
[3] Univ Athens, Dept Informat & Telecommun, GR-10679 Athens, Greece
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2015年 / 9卷 / 04期
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Entity Resolution constitutes a core task for data integration that, due to its quadratic complexity, typically scales to large datasets through blocking methods. These can be configured in two ways. The schema-based configuration relies on schema information in order to select signatures of high distinctiveness and low noise, while the schema-agnostic one treats every token from all attribute values as a signature. The latter approach has significant potential, as it requires no fine-tuning by human experts and it applies to heterogeneous data. Yet, there is no systematic study on its relative performance with respect to the schema-based configuration. This work covers this gap by comparing analytically the two configurations in terms of effectiveness, time efficiency and scalability. We apply them to 9 established blocking methods and to 11 benchmarks of structured data. We provide valuable insights into the internal functionality of the blocking methods with the help of a novel taxonomy. Our studies reveal that the schema-agnostic configuration offers unsupervised and robust definition of blocking keys under versatile settings, trading a higher computational cost for a consistently higher recall than the schema-based one. It also enables the use of state-of-the-art blocking methods without schema knowledge.
引用
收藏
页码:312 / 323
页数:12
相关论文
共 50 条
  • [41] Efficient schema-based XML-to-relational data mapping
    Atay, Mustafa
    Chebotko, Artem
    Liu, Dapeng
    Lu, Shiyong
    Fotouhi, Farshad
    INFORMATION SYSTEMS, 2007, 32 (03) : 458 - 476
  • [42] THE SCHEMA-BASED APPROACH TO WORKFLOW MANAGEMENT
    BROCKMAN, JB
    DIRECTOR, SW
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 1995, 14 (10) : 1257 - 1267
  • [43] Schema-based Constrained XML data indexing and storage Technique
    Chen, Xuebin
    Duan, Guolin
    Yan, Hongcan
    Zhang, Shufen
    Che, Yuee
    2009 INTERNATIONAL CONFERENCE ON NEW TRENDS IN INFORMATION AND SERVICE SCIENCE (NISS 2009), VOLS 1 AND 2, 2009, : 973 - +
  • [44] Three-element schema technique in schema-based user interface design
    Horii, K
    Tsuchiya, K
    INTERNATIONAL JOURNAL OF INDUSTRIAL ERGONOMICS, 1996, 18 (2-3) : 127 - 133
  • [45] Are Schema-Based and Modified Schema-Based Instruction Evidence-Based Practices for Students with Disabilities? A Meta-Analysis
    Yucesoy-Ozkan, Serife
    Cakmak, Zulal
    Cevher, Zehra
    Gulboy, Emrah
    Oz-Alkoyak, Husne
    EDUCATION AND TRAINING IN AUTISM AND DEVELOPMENTAL DISABILITIES, 2022, 57 (04) : 446 - 461
  • [46] LiteIndex: Memory-Efficient Schema-Agnostic Indexing for JSON']JSON documents in SQLite
    Shang, Siqi
    Wu, Qihong
    Wang, Tianyu
    Shao, Zili
    2021 26TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE (ASP-DAC), 2021, : 435 - 440
  • [47] The Schema-Agnostic Queries (SAQ-2015) Semantic Web Challenge: Task Description
    Freitas, Andre
    Unger, Christina
    SEMANTIC WEB EVALUATION CHALLENGES, 2015, 548 : 191 - 198
  • [48] Schema-based memory processes and eyewitness recollection
    Mallard, D
    Greig, J
    AUSTRALIAN JOURNAL OF PSYCHOLOGY, 2005, 57 : 227 - 227
  • [49] Pattern set mining with schema-based constraint
    Cagliero, Luca
    Chiusano, Silvia
    Garza, Paolo
    Bruno, Giulia
    KNOWLEDGE-BASED SYSTEMS, 2015, 84 : 224 - 238
  • [50] A schema-based approach to specifying conversation policies
    Lin, FH
    Norrie, DH
    Shen, WM
    Kremer, R
    ISSUES IN AGENT COMMUNICATION, 2000, 1916 : 193 - 204