Schema-agnostic vs Schema-based Configurations for Blocking Methods on Homogeneous Data

被引:0
|
作者
Papadakis, George [1 ,3 ]
Alexiou, George [2 ]
Papastefanatos, George [2 ]
Koutrika, Georgia
机构
[1] HP Labs, Palo Alto, CA USA
[2] Res Ctr Athena, IMIS, Athens, Greece
[3] Univ Athens, Dept Informat & Telecommun, GR-10679 Athens, Greece
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2015年 / 9卷 / 04期
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Entity Resolution constitutes a core task for data integration that, due to its quadratic complexity, typically scales to large datasets through blocking methods. These can be configured in two ways. The schema-based configuration relies on schema information in order to select signatures of high distinctiveness and low noise, while the schema-agnostic one treats every token from all attribute values as a signature. The latter approach has significant potential, as it requires no fine-tuning by human experts and it applies to heterogeneous data. Yet, there is no systematic study on its relative performance with respect to the schema-based configuration. This work covers this gap by comparing analytically the two configurations in terms of effectiveness, time efficiency and scalability. We apply them to 9 established blocking methods and to 11 benchmarks of structured data. We provide valuable insights into the internal functionality of the blocking methods with the help of a novel taxonomy. Our studies reveal that the schema-agnostic configuration offers unsupervised and robust definition of blocking keys under versatile settings, trading a higher computational cost for a consistently higher recall than the schema-based one. It also enables the use of state-of-the-art blocking methods without schema knowledge.
引用
收藏
页码:312 / 323
页数:12
相关论文
共 50 条
  • [31] Automated database and schema-based data interchange for modeling and simulation
    Harrison, GA
    Maynard, DS
    Pollak, E
    PROCEEDINGS OF THE 2004 WINTER SIMULATION CONFERENCE, VOLS 1 AND 2, 2004, : 191 - 197
  • [32] Xebu: A binary format with schema-based optimizations for XML data
    Kangasharju, J
    Tarkoma, S
    Lindholm, T
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2005, 2005, 3806 : 528 - 535
  • [33] LEARNING PROLOG IN A SCHEMA-BASED ENVIRONMENT
    GEGGHARRISON, TS
    INSTRUCTIONAL SCIENCE, 1991, 20 (2-3) : 173 - 192
  • [34] The Schema-Based Listening Teaching of English
    韦妙
    海外英语, 2011, (08) : 33 - 36
  • [35] Schema-Based Query Rewriting in SPARQL
    Jiang, Lili
    Luo, Jie
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2016, 2016, 9983 : 275 - 285
  • [36] Linking OpenStreetMap with knowledge graphs - Link discovery for schema-agnostic volunteered geographic information
    Tempelmeier, Nicolas
    Demidova, Elena
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2021, 116 : 349 - 364
  • [37] A survey of schema-based matching approaches
    Shvaiko, P
    Euzenat, J
    JOURNAL ON DATA SEMANTICS IV, 2005, 3730 : 146 - 171
  • [38] A schema-based XML index structure
    College of Computer Science, Chongqing University, Chongqing 400044, China
    Jisuanji Gongcheng, 2006, 18 (64-66):
  • [39] Schema-Based JSON']JSON Data Stores in Relational Databases
    Irshad, Lubna
    Yan, Li
    Ma, Zongmin
    JOURNAL OF DATABASE MANAGEMENT, 2019, 30 (03) : 38 - 70
  • [40] Schema-Based Visual Queries over Linked Data Endpoints
    Cerans, Karlis
    Lace, Lelde
    Romane, Aiga
    Ovcinnikova, Julija
    Grasmanis, Mikus
    Sprogis, Arturs
    Sostaks, Agris
    METADATA AND SEMANTIC RESEARCH, MTSR 2019, 2019, 1057 : 200 - 206