Robust and Distributed Web-Scale Near-Dup Document Conflation in Microsoft Academic Service

被引：0

作者：

Wu, Chieh-Han ^{[1
]}

Song, Yang ^{[1
]}

机构：

[1] Microsoft Res, One Microsoft Way, Redmond, WA 98052 USA

来源：

PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA | 2015年

关键词：

Near-duplicate detection; shingling algorithm; n-gram; entity conflation;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In modern web-scale applications that collect data from different sources, entity conflation is a challenging task due to various data quality issues. In this paper, we propose a robust and distributed framework to perform conflation on noisy data in the Microsoft Academic Service dataset. Our framework contains two major components. In the offline component, we train a GBDT model to determine whether two papers from different sources should be conflated to the same paper entity. In the online component, we propose a scalable shingling algorithm that can apply our offline model to over 100 million instances. The result shows that our algorithm can conflate noisy data robustly and efficiently.

引用

页码：2606 / 2611

页数：6