Schema-agnostic vs Schema-based Configurations for Blocking Methods on Homogeneous Data

被引:0
|
作者
Papadakis, George [1 ,3 ]
Alexiou, George [2 ]
Papastefanatos, George [2 ]
Koutrika, Georgia
机构
[1] HP Labs, Palo Alto, CA USA
[2] Res Ctr Athena, IMIS, Athens, Greece
[3] Univ Athens, Dept Informat & Telecommun, GR-10679 Athens, Greece
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2015年 / 9卷 / 04期
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Entity Resolution constitutes a core task for data integration that, due to its quadratic complexity, typically scales to large datasets through blocking methods. These can be configured in two ways. The schema-based configuration relies on schema information in order to select signatures of high distinctiveness and low noise, while the schema-agnostic one treats every token from all attribute values as a signature. The latter approach has significant potential, as it requires no fine-tuning by human experts and it applies to heterogeneous data. Yet, there is no systematic study on its relative performance with respect to the schema-based configuration. This work covers this gap by comparing analytically the two configurations in terms of effectiveness, time efficiency and scalability. We apply them to 9 established blocking methods and to 11 benchmarks of structured data. We provide valuable insights into the internal functionality of the blocking methods with the help of a novel taxonomy. Our studies reveal that the schema-agnostic configuration offers unsupervised and robust definition of blocking keys under versatile settings, trading a higher computational cost for a consistently higher recall than the schema-based one. It also enables the use of state-of-the-art blocking methods without schema knowledge.
引用
收藏
页码:312 / 323
页数:12
相关论文
共 50 条
  • [1] Schema-agnostic Blocking for Streaming Data
    Araujo, Tiago Brasileiro
    Stefanidis, Kostas
    Santos Pires, Carlos Eduardo
    Nummenmaa, Jyrki
    da Nobrega, Thiago Pereira
    PROCEEDINGS OF THE 35TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING (SAC'20), 2020, : 412 - 419
  • [2] A Noise Tolerant and Schema-agnostic Blocking Technique for Entity Resolution
    Araujo, Tiago Brasileiro
    Santos Pires, Carlos Eduardo
    Mestre, Demetrio Gomes
    da Nobrega, Thiago Pereira
    do Nascimento, Dimas Cassimiro
    Stefanidis, Kostas
    SAC '19: PROCEEDINGS OF THE 34TH ACM/SIGAPP SYMPOSIUM ON APPLIED COMPUTING, 2019, : 422 - 430
  • [3] Schema-Agnostic Indexing with Azure DocumentDB
    Shukla, Dharma
    Thota, Shireesh
    Raman, Karthik
    Gajendran, Madhan
    Shah, Ankur
    Ziuzin, Sergii
    Sundaram, Krishnan
    Guajardo, Miguel Gonzalez
    Wawrzyniak, Anna
    Boshra, Samer
    Ferreira, Renato
    Nassar, Mohamed
    Koltachev, Michael
    Huang, Ji
    Sengupta, Sudipta
    Levandoski, Justin
    Lomet, David
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 8 (12): : 1668 - 1679
  • [4] Schema-Agnostic Progressive Entity Resolution
    Simonini, Giovanni
    Papadakis, George
    Palpanas, Themis
    Bergamaschi, Sonia
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2019, 31 (06) : 1208 - 1221
  • [5] Schema-agnostic Progressive Entity Resolution
    Simonini, Giovanni
    Papadakis, George
    Palpanas, Themis
    Bergamaschi, Sonia
    2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2018, : 53 - 64
  • [6] Schema-Agnostic Query Rewriting in SPARQL 1.1
    Bischof, Stefan
    Kroetzsch, Markus
    Polleres, Axel
    Rudolph, Sebastian
    SEMANTIC WEB - ISWC 2014, PT I, 2014, 8796 : 584 - 600
  • [7] An XML Schema-Based Data Integration
    Ran, Chong-Shan
    Wang, Ma-Chuan
    PROCEEDINGS OF 2010 3RD IEEE INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY (ICCSIT 2010), VOL 7, 2010, : 100 - 102
  • [8] FARE: Schema-Agnostic Anomaly Detection in Social Event Logs
    Shah, Neil
    2019 IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA 2019), 2019, : 337 - 350
  • [9] Simplifying Entity Resolution on Web Data with Schema-agnostic, Non-iterative Matching
    Efthymiou, Vasilis
    Papadakis, George
    Stefanidis, Kostas
    Christophides, Vassilis
    2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2018, : 1296 - 1299
  • [10] An XML schema-based semantic data integration
    Kim, Dongkwang
    Jeong, Karpjoo
    Shin, Hyoseop
    Hwang, Suntae
    GCC 2005: FIFTH INTERNATIONAL CONFERENCE ON GRID AND COOPERATIVE COMPUTING, PROCEEDINGS, 2006, : 522 - +