A Multi-Pass Blocking Based Pay-as-you-go Entity Resolution Approach

被引:0
|
作者
Sun C.-C. [1 ]
Shen D.-R. [1 ]
Kou Y. [1 ]
Nie T.-Z. [1 ]
Yu G. [1 ]
机构
[1] School of Computer Science and Engineering, Northeastern University, Shenyang
来源
基金
中国国家自然科学基金;
关键词
Candidate pair selection; Data cleaning; Data integration; Entity resolution; Multi-pass blocking; Pay-as-you-go;
D O I
10.11897/SP.J.1016.2019.01704
中图分类号
学科分类号
摘要
Entity resolution (ER) is a key aspect of data integration and data cleaning, and is a necessary pre-processing step of data analytic and data mining. Traditional ER approaches take a whole dirty dataset as input, and output the complete ER result after a batch based process. However, nowadays many new applications emerge, demanding (nearly) real-time data analytic, but traditional batch based ER approaches cannot satisfy such requirements. For instance, a finance news feed tries to resolve as many companies and persons as possible within limited time, from financial data generated frequently. In order to fulfill such requirements, Pay-as-you-go ER tries to maximize number of resolved duplicate data objects given limited time (far shorter than overall running time). Pay-as-you-go ER is also called progressive ER, since it resolves data objects progressively. The more data objects an ER approach resolve within limited time, the higher its progressiveness is. The core challenge of Pay-as-you-go ER is to effectively select the most matchable object pairs for comparison with high priorities. Existing Pay-as-you-go ER solutions rely upon perfect blocking keys or sorting keys for pair selection. However, the best blocking/sorting keys cannot be got without deep domain knowledge and fully understanding of each dataset. It is impossible for common users. Worse still, perfect blocking/sorting keys do not always exist. We try to work out an effective Pay-as-you-go ER solution without perfect blocking/sorting keys. We resolve data objects progressively with multi-pass blocking. Multi-pass blocking results in blocking redundancy, which is helpful for computing match probabilities of candidate object pairs. Intuitively, the more blocks a pair shares, the more matchable the pair is. Yet different blocks usually offer different contributions to match probability of a pair, so it is necessary to evaluate the redundancy of each block. Meanwhile, blocking redundancy has to be eliminated efficiently before pair comparisons. We propose a blocking based Pay-as-you-go ER(BPER) approach. BPER utilizes multi-pass blocking instead of perfect blocking/sorting keys based techniques. Redundancy of each block is evaluated dynamically, and the evaluation result is called block credit. Block credits are used for real-time pair match probability computing. Also, an efficient graph based method is proposed to eliminate blocking redundancy. BPER consists of two stages: the initialization stage and the iterative stage. In the initialization stage, generate candidate data object pairs, and sort them according to match probabilities in a candidate queue. In the iterative stage, each time choose the front candidate pair (the most matchable pair) of the candidate queue for processing; dynamically update candidate pairs' match probabilities according to the real-time ER result, and then update the candidate queue. Candidate pairs resolution and block credits computation proceed interactively, and promote each other. As a result, the most matchable pairs are selected and resolved in real time. In such a way, the proposed ER approach reduces useless data object comparisons, and optimizes the real-time ER result. Finally, we experimentally evaluate the proposed Pay-as-you-go ER approach over real datasets and synthetic datasets. The experiment results show that BPER improves existing works greatly. We also evaluate the contribution of each component to progressiveness in BPER. © 2019, Science Press. All right reserved.
引用
下载
收藏
页码:1704 / 1720
页数:16
相关论文
共 25 条
  • [1] Naumann F., Herschel M., An introduction to duplicate detection, Synthesis Lectures on Data Management, 2, 1, pp. 1-87, (2010)
  • [2] Sun C.-C., Shen D.-R., Et al., A related data oriented joint entity resolution approach, Chinese Journal of Computers, 38, 9, pp. 1739-1754, (2015)
  • [3] Sun C., Shen D., Et al., A genetic algorithm based entity resolution approach with active learning, Frontier of Computer Science, 11, 1, pp. 147-159, (2017)
  • [4] Mudgal S., Li H., Rekatsinas T., Et al., Deep learning for entity matching: A design space exploration, Proceedings of the ACM 2018 International Conference on Management of Data, pp. 19-34, (2018)
  • [5] Galhotra S., Firmani D., Saha B., Et al., Robust entity resolution using random graphs, Proceedings of the ACM 2018 International Conference on Management of Data, pp. 3-18, (2018)
  • [6] Chi Y., Hong J., Jurek A., Et al., Privacy preserving record linkage in the presence of missing values, Information Systems, 71, 11, pp. 199-210, (2017)
  • [7] Wang H., Ding X., Li J., Et al., Rule-based entity resolution on database with hidden temporal information, IEEE Transactions on Knowledge and Data Engineering, 30, 11, pp. 2199-2212, (2018)
  • [8] Elmagarmid A.K., Ipeirotis P.G., Verykios V.S., Duplicate record detection: A survey, IEEE Transactions on Knowledge and Data Engineering, 19, 1, pp. 1-16, (2007)
  • [9] Benjelloun O., Garcia-Molina H., Et al., Swoosh: A generic approach to entity resolution, The International Journal on Very Large Data Bases, 18, 1, pp. 255-276, (2009)
  • [10] Whang S.E., Marmaros D., Garcia-Molina H., Pay-as-you-go entity resolution, IEEE Transactions on Knowledge and Data Engineering, 25, 5, pp. 1111-1124, (2013)