Learning to Discover Domain-Specific Web Content

被引:4
|
作者
Pham, Kien [1 ]
Santos, Aecio [1 ]
Freire, Juliana [1 ]
机构
[1] NYU, New York, NY 10003 USA
关键词
D O I
10.1145/3159652.3159724
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The ability to discover all content relevant to an information domain has many applications, from helping in the understanding of humanitarian crises to countering human and arms trafficking. In such applications, time is of essence: it is crucial to both maximize coverage and identify new content as soon as it becomes available, so that appropriate actions can be taken. In this paper, we propose new methods for efficient domain-specific re-crawling that maximize the yield for new content. By learning patterns of pages that have a high yield, our methods select a small set of pages that can be re-crawled frequently, increasing the coverage and freshness while conserving resources. Unlike previous approaches to this problem, our methods combine different factors to optimize the re-crawling strategy, do not require full snapshots for the learning step, and dynamically adapt the strategy as the crawl progresses. In an empirical evaluation, we have simulated the framework over 600 partial crawl snapshots in three different domains. The results show that our approach can achieve 150% higher coverage compared to existing, state-of-the-art techniques. In addition, it is also able to capture 80% of new relevant content within less than 4 hours of publication.
引用
收藏
页码:432 / 440
页数:9
相关论文
共 50 条
  • [1] Bootstrapping Domain-Specific Content Discovery on the Web
    Kien Pham
    Santos, Aecio
    Freire, Juliana
    WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019), 2019, : 1476 - 1486
  • [2] A Web Application Is a Domain-Specific Language
    Lorenz, David H.
    Rosenan, Boaz
    COMPANION PROCEEDINGS OF THE 2016 ACM SIGPLAN INTERNATIONAL CONFERENCE ON SYSTEMS, PROGRAMMING, LANGUAGES AND APPLICATIONS: SOFTWARE FOR HUMANITY (SPLASH COMPANION'16), 2016, : 35 - 36
  • [3] Domain-specific web search with keyword spices
    Oyama, S
    Kokubo, T
    Ishida, T
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2004, 16 (01) : 17 - 27
  • [4] DSDD: Domain-Specific Dataset Discovery on the Web
    Zhang, Haoxiang
    Santos, Aecio
    Freire, Juliana
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 2527 - 2536
  • [5] Domain-specific ontology merging for the semantic web
    Taylor, JM
    Poliakov, D
    Mazlack, LJ
    NAFIPS 2005 - 2005 Annual Meeting of the North American Fuzzy Information Processing Society, 2005, : 418 - 423
  • [6] Prioritization of Domain-Specific Web Information Extraction
    Huang, Jian
    Yu, Cong
    PROCEEDINGS OF THE TWENTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-10), 2010, : 1327 - 1333
  • [7] Domain-Specific Deep Web Sources Discovery
    Wang, Ying
    Zuo, Wanli
    Peng, Tao
    He, Fengling
    ICNC 2008: FOURTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, VOL 5, PROCEEDINGS, 2008, : 202 - 206
  • [8] SNPMiner: A domain-specific deep web mining
    Wang, Fan
    Agrawal, Gagan
    Jin, Ruoming
    Piontkivska, Helen
    PROCEEDINGS OF THE 7TH IEEE INTERNATIONAL SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING, VOLS I AND II, 2007, : 192 - +
  • [9] Crawling for domain-specific Hidden Web resources
    Bergholz, A
    Chidlovskii, B
    FOURTH INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS ENGINEERING, PROCEEDINGS, 2003, : 125 - 133
  • [10] Learning Domain-Specific Polarity Lexicons
    Demiroz, Gulsen
    Yanikoglu, Berrin
    Tapucu, Dilek
    Saygin, Yucel
    12TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2012), 2012, : 674 - 679