Human-in-the-Loop Data Integration System

被引:0
|
作者
Sun J. [1 ]
Li G.-L. [1 ]
机构
[1] Department of Computer Science, Tsinghua University, Beijing
来源
基金
中国国家自然科学基金;
关键词
Cost optimization; Data integration; Entity consolidation; Entity matching; Human-in-the-loop; Machine learning; Similarity queries;
D O I
10.11897/SP.J.1016.2022.00654
中图分类号
学科分类号
摘要
An end-to-end data integration system requires human feedback in several phases, including collecting training data for entity matching, debugging the resulting clusters, confirming transformations applied on these clusters for data standardization,and finally, reducing each cluster to a single,canonical representation (or "golden record"). The traditional wisdom is to sequentially apply the human feedback, obtained by asking specific questions,within some budget in each phase. However, these questions are highly correlated; the answer to one can influence the outcome of any of the phases of the pipeline. Hence, interleaving them has the potential to offer significant benefits. In this paper, we propose a human-in-the- loop framework that interleaves different types of questions to optimize human involvement. We propose benefit models to measure the quality improvement from asking a question, and cost models to measure the human time it takes to answer a question. We develop a question scheduling framework that judiciously selects questions to maximize the accuracy of the final golden records. Experimental results on three real-world datasets show that our holistic method significantly improves the quality of golden records from 70% to 90%, compared with the state-of-the-art approaches. © 2022, Science Press. All right reserved.
引用
收藏
页码:654 / 668
页数:14
相关论文
共 31 条
  • [1] Moll O, Zalewski A, Pillai S, Madden S, Stonebraker M, Gadepally V, Exploring big volume sensor data with vroom, Proceedings of the International Conference on Very Large Data Bases, 10, 12, pp. 1973-1976, (2017)
  • [2] Wang J, Li G, Yu J X., Feng J, Entity matching: How similar is similar, Proceedings of the International Conference on Very Large Data Bases, 4, 10, pp. 622-633, (2011)
  • [3] Panahi F, Wu W, Doan A, Naughton J F, Towards interactive debugging of rule-based entity matching, Proceedings of the International Conference on Extending Database Technology, pp. 354-365, (2017)
  • [4] Ebraheem M, Thirumuruganathan S, Joty S, Ouzzani M, Tang N, Distributed representations of tuples for entity resolution, Proceedings of the International Conference on Very Large Data Bases, 11, 11, pp. 1454-1467, (2018)
  • [5] Konda P, Das S, Doan A, Ardalan A, Ballard J R, Li H, Panahi F, Zhang H, Naughton J F, Prasad S, Krishnan G, Deep R, Raghavendra V, Magellan: Toward building entity matching management systems, Proceedings of the International Conference on Very Large Data Bases, 9, 12, pp. 1197-1208, (2016)
  • [6] Chai C, Li G, Li J, Deng D, Feng J, Cost-effective crowdsourced entity resolution: A partial-order approach, Proceedings of the International Conference on Management of Data, pp. 969-984, (2016)
  • [7] Das S, Doan A, Naughton J F, Krishnan G, Deep R, Arcaute E, Raghavendra V, Park Y, Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services, Proceedings of the International Conference on Management of Data, pp. 1431-1446, (2017)
  • [8] Wang J, Kraska T, Franklin M J, Feng J, Crowder: Crowdsourcing entity resolution, Proceedings of the International Conference on Very Large Data Bases, 5, 11, pp. 1483-1494, (2012)
  • [9] Wang J, Li G, Kraska T, Franklin M J, Feng J, Leveraging transitive relations for crowdsourced joins, Proceedings of the International Conference on Management of Data, pp. 229-240, (2013)
  • [10] Abedjan Z, Morcos J, Ilyas I F, Ouzzani M, Papotti P, Stonebraker M, Dataxformer: A robust transformation discovery system, Proceedings of the International Conference on Data Engineering, pp. 1134-1145, (2016)