Robust and Distributed Web-Scale Near-Dup Document Conflation in Microsoft Academic Service

被引:0
|
作者
Wu, Chieh-Han [1 ]
Song, Yang [1 ]
机构
[1] Microsoft Res, One Microsoft Way, Redmond, WA 98052 USA
来源
PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA | 2015年
关键词
Near-duplicate detection; shingling algorithm; n-gram; entity conflation;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In modern web-scale applications that collect data from different sources, entity conflation is a challenging task due to various data quality issues. In this paper, we propose a robust and distributed framework to perform conflation on noisy data in the Microsoft Academic Service dataset. Our framework contains two major components. In the offline component, we train a GBDT model to determine whether two papers from different sources should be conflated to the same paper entity. In the online component, we propose a scalable shingling algorithm that can apply our offline model to over 100 million instances. The result shows that our algorithm can conflate noisy data robustly and efficiently.
引用
收藏
页码:2606 / 2611
页数:6
相关论文
empty
未找到相关数据