Query-Driven Sampling for Collective Entity Resolution

被引:2
|
作者
Grant, Christan [1 ]
Wang, Daisy Zhe [2 ]
Wick, Michael [3 ]
机构
[1] Univ Oklahoma, Norman, OK 73019 USA
[2] Univ Florida, Gainesville, FL 32611 USA
[3] Univ Massachusetts, Amherst, MA 01003 USA
关键词
D O I
10.1109/IRI.2016.34
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Entity Resolution is the process of determining records (mentions) in a database that correspond to the same real-world entity. Traditional pairwise ER methods can lead to inconsistencies and low accuracy due to localized decisions. Leading ER systems solve this problem by collectively resolving all records using a probabilistic graphical model and Markov chain Monte Carlo (MCMC) inference. However, for large datasets this is an extremely expensive process. One key observation is that, such exhaustive ER process incurs a huge up-front cost, which is wasteful in practice because most users are interested in only a small subset of entities. In this paper, we advocate pay-as-you-go entity resolution by developing a number of query-driven collective ER techniques. We introduce two classes of SQL queries that involve ER operators - selection-driven ER and join-driven ER. We implement novel variations of the MCMC Metropolis Hastings algorithm to generate biased samples and selectivity-based scheduling algorithms to support the two classes of ER queries. Finally, we show that query-driven ER algorithms can converge and return results within minutes over a database populated with the extraction from a newswire dataset containing 71 million mentions.
引用
收藏
页码:208 / 217
页数:10
相关论文
共 50 条
  • [21] Gamification of query-driven knowledge sharing systems
    Van Toorn, Christine
    Kirshner, Samuel Nathan
    Gabb, James
    BEHAVIOUR & INFORMATION TECHNOLOGY, 2022, 41 (05) : 959 - 980
  • [22] Query-driven visualization of large data sets
    Stockinger, K
    Shalf, J
    Wu, KS
    Bethel, EW
    IEEE VISUALIZATION 2005, PROCEEDINGS, 2005, : 167 - 174
  • [23] Query-driven Repair of Functional Dependency Violations
    Giannakopoulou, Stella
    Karpathiotakis, Manos
    Ailamaki, Anastasia
    2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020), 2020, : 1894 - 1897
  • [24] Semantics of Query-Driven Communication of Exact Values
    Konecny, Michal
    Farjudian, Amin
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2010, 16 (18) : 2597 - 2628
  • [25] Sonata: Query-Driven Streaming Network Telemetry
    Gupta, Arpit
    Harrison, Rob
    Canini, Marco
    Feamster, Nick
    Rexford, Jennifer
    Willinger, Walter
    PROCEEDINGS OF THE 2018 CONFERENCE OF THE ACM SPECIAL INTEREST GROUP ON DATA COMMUNICATION (SIGCOMM '18), 2018, : 357 - 371
  • [26] Learning to Accurately COUNT with Query-Driven Predictive Analytics
    Anagnostopoulos, Christos
    Triantafillou, Peter
    PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 14 - 23
  • [27] Query-driven graph models in e-commerce
    Tuteja, Sonal
    Kumar, Rajeev
    INNOVATIONS IN SYSTEMS AND SOFTWARE ENGINEERING, 2023, 19 (02) : 177 - 195
  • [28] Query-driven support pattern discovery for classification learning
    Han, YQ
    Lam, W
    FOURTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2004, : 399 - 402
  • [29] Query-Driven Discovery of Anomalous Subgraphs in Attributed Graphs
    Wu, Nannan
    Chen, Feng
    Li, Jianxin
    Huai, Jinpeng
    Li, Bo
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3105 - 3111
  • [30] Pharos: Query-Driven Schema Inference for the Semantic Web
    Haller, David
    Lenz, Richard
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2019, PT II, 2020, 1168 : 112 - 124