Constructing a Comprehensive Events Database from the Web

被引:9
|
作者
Wang, Qifan [1 ]
Kanagal, Bhargav [1 ]
Garg, Vijay [1 ]
Sivakumar, D. [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
关键词
structure data; event data extraction; consolidation; wrapper; EXTRACTION;
D O I
10.1145/3357384.3357986
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we consider the problem of constructing a comprehensive database of events taking place around the world. Events include small hyper-local events like farmer's markets, neighborhood garage sales, as well as larger concerts and festivals. Designing a high-precision and high-recall event extractor from unstructured pages across the whole web is a challenging problem. We cannot resort overly to domain-specific strategies since it needs to work on all web pages, including on new domains; we need to account for variations in page layouts and structure across websites. Further, we need to deal with low-quality pages on the web with limited structure. We have built an ML-powered extraction system to solve this problem, using schema.org annotations as training data. Our extraction system operates in two phases. In the first phase, we generate raw event information from individual web pages. To do this, an event page classifier predicts if a web page contains any event information; this is then followed by a single/multiple classifier that decides if the page contains a single event or multiple events; the first phase concludes by applying event extractors that extract the key fields of a public event (the title, the date/time information, and the location information). In the second phase, we further improve the extraction quality via three novel algorithms, repeated patterns, event consolidation and wrapper induction, which are designed to use the raw event extractions as input and generate events whose quality is significantly higher. We evaluate our extraction models on two large scale publicly available web corpus, Common Crawl and ClueWeb12. Experimental analysis shows that our methodology achieves over 95% extraction precision and recall on both datasets.
引用
收藏
页码:229 / 238
页数:10
相关论文
共 50 条
  • [1] Constructing the Web of Events from raw data in the Web of Things
    Sun, Yunchuan
    Yan, Hongli
    Lu, Cheng
    Bie, Rongfang
    Zhou, Zhangbing
    [J]. MOBILE INFORMATION SYSTEMS, 2014, 10 (01) : 105 - 125
  • [2] Constructing a Comprehensive National Wildfire Database from Incomplete Sources: Israel as a Case Study
    Guk, Edna
    Bar-Massada, Avi
    Levin, Noam
    [J]. FIRE-SWITZERLAND, 2023, 6 (04):
  • [3] Spider Mites Web: A comprehensive database for the Tetranychidae
    Migeon, Alain
    Nouguier, Elodie
    Dorkeld, Franck
    [J]. TRENDS IN ACAROLOGY, 2010, : 557 - 560
  • [4] The ecoinvent database system: a comprehensive web-based LCA database
    Frischknecht, R
    Rebitzer, G
    [J]. JOURNAL OF CLEANER PRODUCTION, 2005, 13 (13-14) : 1337 - 1343
  • [5] A COMPREHENSIVE DATABASE OF FLOOD EVENTS IN THE CONTIGUOUS UNITED STATES FROM 2002 TO 2013
    Shen, Xinyi
    Mei, Yiwen
    Anagnostou, Emmmmanouil N.
    [J]. BULLETIN OF THE AMERICAN METEOROLOGICAL SOCIETY, 2017, 98 (07) : 1493 - 1502
  • [6] ON CONSTRUCTING INSTANTS FROM EVENTS
    THOMASON, SK
    [J]. JOURNAL OF PHILOSOPHICAL LOGIC, 1984, 13 (01) : 85 - 96
  • [7] MetaADEDB 2.0: a comprehensive database on adverse drug events
    Yu, Zhuohang
    Wu, Zengrui
    Li, Weihua
    Liu, Guixia
    Tang, Yun
    [J]. BIOINFORMATICS, 2021, 37 (15) : 2221 - 2222
  • [8] Comprehensive Analysis of Adverse Events Associated With Hypoglossal Nerve Stimulators: Insights From the MAUDE Database
    Bentan, Mihai A.
    Nord, Ryan
    [J]. OTOLARYNGOLOGY-HEAD AND NECK SURGERY, 2024,
  • [9] CASH: a constructing comprehensive splice site method for detecting alternative splicing events
    Wu, Wenwu
    Zong, Jie
    Wei, Ning
    Cheng, Jian
    Zhou, Xuexia
    Cheng, Yuanming
    Chen, Dai
    Guo, Qinghua
    Zhang, Bo
    Feng, Ying
    [J]. BRIEFINGS IN BIOINFORMATICS, 2018, 19 (05) : 905 - 917
  • [10] Constructing a corpus from the web: message boards
    Claridge, Claudia
    [J]. Corpus Linguistics and the Web, 2007, 59 : 87 - 108