Multi-attribute Data Indexing for Query Based Entity Resolution

被引:0
|
作者
Sun C.-C. [1 ]
Shen D.-R. [2 ]
Xiao Y.-Y. [1 ]
Li Y.-K. [1 ]
机构
[1] Key Laboratory of Computer Vision and System of Ministry of Education (Tianjin University of Technology), Tianjin
[2] School of Computer Science and Engineering, Northeastern University, Shenyang
来源
Ruan Jian Xue Bao/Journal of Software | 2022年 / 33卷 / 06期
关键词
Data integration; Data preprocessing; Entity resolution; Multi-attribute data indexing; Query based;
D O I
10.13328/j.cnki.jos.006284
中图分类号
学科分类号
摘要
Entity resolution is a key aspect of data integration, and also is a necessary preprocessing step of big data analytics and mining. In big data era, more and more query-driven data analytics applications come out, and query-based entity resolution becomes a hot topic. This work studies multi-attribute data indexing technology for entity cache in order to promote query-resolution efficiency. There are two core problems. One is how to design the multi-attributeindex. An R-tree based multi-attributeindex is designed. Entity cache is produced online, so an online index construction method is proposed based on spatial clustering. A filter-verify based multi-dimensional query method is proposed. It filters impossible records by the multi-attributeindex, and then verifies each candidate record with similarity functions or distance functions. The other ishow to insert different string attributes into the tree index. The basic solution is mapping strings into integer spaces. For Jaccard similarity and edit similarity, a q-gram based mapping method is proposed, and is improved by vector dimension reduction and z-order, which achieves high mapping qualities. Finally, the proposed hybrid index is experimentally evaluated on two datasets. Its effectiveness is validated, and moreover, different aspects of the multi-attribute index are also tested. © Copyright 2022, Institute of Software, the Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:2331 / 2347
页数:16
相关论文
共 30 条
  • [1] Elmagarmid AK, Ipeirotis PG, Verykios VS., Duplicate record detection: A survey, IEEE Trans. on Knowledge and Data Engineering, 19, 1, pp. 1-16, (2007)
  • [2] Sun CC, Shen DR, Li YK, Xiao YY, Ma JH., Research on record pair ranking for entity resolution with time constraint, Ruan Jian Xue Bao/Journal of Software, 31, 3, pp. 695-709, (2020)
  • [3] Ramadan B, Christen P, Liang H, Gayler RW., Dynamic sorted neighborhood indexing for real-time entity resolution, Journal of Data and Information Quality (JDIQ), 6, 4, (2015)
  • [4] Yu M, Li G, Deng D, Feng J., String similarity search and join: A survey, Frontiers of Computer Science, 10, 3, pp. 399-417, (2016)
  • [5] Zhang Z, Hadjieleftheriou M, Ooi BC, Srivastava D., Bed-tree: An all-purpose index structure for string similarity search based on edit distance, Proc. of the 2010 ACM SIGMOD Int'l Conf. on Management of Data, pp. 915-926, (2010)
  • [6] Zhang Y, Li X, Wang J, Zhang Y, Xing C, Yuan X., An efficient framework for exact set similarity search using tree structure indexes, Proc. of the 33rd IEEE Int'l Conf. on Data Engineering (ICDE), pp. 759-770, (2017)
  • [7] Li G, He J, Deng D, Li J., Efficient similarity join and search on multi-attribute data, Proc. of the 2015 ACM SIGMOD Int'l Conf. on Management of Data, pp. 1137-1151, (2015)
  • [8] Gaede V, Gunther O., Multidimensional access methods, ACM Computing Surveys (CSUR), 30, 2, pp. 170-231, (1998)
  • [9] Guttman A., R-trees: A dynamic index structure for spatial searching, Proc. of the 1984 ACM SIGMOD Int'l Conf. on Management of Data, pp. 47-57, (1984)
  • [10] Shao J, Wang Q, Lin Y., Skyblocking for entity resolution, Information Systems, 85, pp. 30-43, (2019)