Constructing a Comprehensive Events Database from the Web

被引：9

作者：

Wang, Qifan ^{[1
]}

Kanagal, Bhargav ^{[1
]}

Garg, Vijay ^{[1
]}

Sivakumar, D. ^{[1
]}

机构：

[1] Google Res, Mountain View, CA 94043 USA

来源：

PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19) | 2019年

关键词：

structure data; event data extraction; consolidation; wrapper; EXTRACTION;

D O I：

10.1145/3357384.3357986

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

In this paper, we consider the problem of constructing a comprehensive database of events taking place around the world. Events include small hyper-local events like farmer's markets, neighborhood garage sales, as well as larger concerts and festivals. Designing a high-precision and high-recall event extractor from unstructured pages across the whole web is a challenging problem. We cannot resort overly to domain-specific strategies since it needs to work on all web pages, including on new domains; we need to account for variations in page layouts and structure across websites. Further, we need to deal with low-quality pages on the web with limited structure. We have built an ML-powered extraction system to solve this problem, using schema.org annotations as training data. Our extraction system operates in two phases. In the first phase, we generate raw event information from individual web pages. To do this, an event page classifier predicts if a web page contains any event information; this is then followed by a single/multiple classifier that decides if the page contains a single event or multiple events; the first phase concludes by applying event extractors that extract the key fields of a public event (the title, the date/time information, and the location information). In the second phase, we further improve the extraction quality via three novel algorithms, repeated patterns, event consolidation and wrapper induction, which are designed to use the raw event extractions as input and generate events whose quality is significantly higher. We evaluate our extraction models on two large scale publicly available web corpus, Common Crawl and ClueWeb12. Experimental analysis shows that our methodology achieves over 95% extraction precision and recall on both datasets.

引用

页码：229 / 238

页数：10

共 50 条

[1] Constructing the Web of Events from raw data in the Web of Things
Sun, Yunchuan
Yan, Hongli
Lu, Cheng
Bie, Rongfang
Zhou, Zhangbing
[J]. MOBILE INFORMATION SYSTEMS, 2014, 10 (01) : 105 - 125
[2] Constructing a Comprehensive National Wildfire Database from Incomplete Sources: Israel as a Case Study
Guk, Edna
Bar-Massada, Avi
Levin, Noam
[J]. FIRE-SWITZERLAND, 2023, 6 (04):
[3] Spider Mites Web: A comprehensive database for the Tetranychidae
Migeon, Alain
Nouguier, Elodie
Dorkeld, Franck
[J]. TRENDS IN ACAROLOGY, 2010, : 557 - 560
[4] The ecoinvent database system: a comprehensive web-based LCA database
Frischknecht, R
Rebitzer, G
[J]. JOURNAL OF CLEANER PRODUCTION, 2005, 13 (13-14) : 1337 - 1343
[5] A COMPREHENSIVE DATABASE OF FLOOD EVENTS IN THE CONTIGUOUS UNITED STATES FROM 2002 TO 2013
Shen, Xinyi
Mei, Yiwen
Anagnostou, Emmmmanouil N.
[J]. BULLETIN OF THE AMERICAN METEOROLOGICAL SOCIETY, 2017, 98 (07) : 1493 - 1502
[6] ON CONSTRUCTING INSTANTS FROM EVENTS
THOMASON, SK
[J]. JOURNAL OF PHILOSOPHICAL LOGIC, 1984, 13 (01) : 85 - 96
[7] MetaADEDB 2.0: a comprehensive database on adverse drug events
Yu, Zhuohang
Wu, Zengrui
Li, Weihua
Liu, Guixia
Tang, Yun
[J]. BIOINFORMATICS, 2021, 37 (15) : 2221 - 2222
[8] Comprehensive Analysis of Adverse Events Associated With Hypoglossal Nerve Stimulators: Insights From the MAUDE Database
Bentan, Mihai A.
Nord, Ryan
[J]. OTOLARYNGOLOGY-HEAD AND NECK SURGERY, 2024,
[9] CASH: a constructing comprehensive splice site method for detecting alternative splicing events
Wu, Wenwu
Zong, Jie
Wei, Ning
Cheng, Jian
Zhou, Xuexia
Cheng, Yuanming
Chen, Dai
Guo, Qinghua
Zhang, Bo
Feng, Ying
[J]. BRIEFINGS IN BIOINFORMATICS, 2018, 19 (05) : 905 - 917
[10] Constructing a corpus from the web: message boards
Claridge, Claudia
[J]. Corpus Linguistics and the Web, 2007, 59 : 87 - 108

← 1 2 3 4 5 →