Learning to Discover Domain-Specific Web Content

被引:4
|
作者
Pham, Kien [1 ]
Santos, Aecio [1 ]
Freire, Juliana [1 ]
机构
[1] NYU, New York, NY 10003 USA
关键词
D O I
10.1145/3159652.3159724
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The ability to discover all content relevant to an information domain has many applications, from helping in the understanding of humanitarian crises to countering human and arms trafficking. In such applications, time is of essence: it is crucial to both maximize coverage and identify new content as soon as it becomes available, so that appropriate actions can be taken. In this paper, we propose new methods for efficient domain-specific re-crawling that maximize the yield for new content. By learning patterns of pages that have a high yield, our methods select a small set of pages that can be re-crawled frequently, increasing the coverage and freshness while conserving resources. Unlike previous approaches to this problem, our methods combine different factors to optimize the re-crawling strategy, do not require full snapshots for the learning step, and dynamically adapt the strategy as the crawl progresses. In an empirical evaluation, we have simulated the framework over 600 partial crawl snapshots in three different domains. The results show that our approach can achieve 150% higher coverage compared to existing, state-of-the-art techniques. In addition, it is also able to capture 80% of new relevant content within less than 4 hours of publication.
引用
收藏
页码:432 / 440
页数:9
相关论文
共 50 条
  • [21] Organizing domain-specific information on the Web: An experiment on the Spanish business Web directory
    Chung, Wingyan
    Lai, Gump
    Bonillas, Alfonso
    Xi, Wei
    Chen, Hsinchun
    INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES, 2008, 66 (02) : 51 - 66
  • [22] Domain-specific model differencing for graphical domain-specific languages
    Jafarlou, Manouchehr Zadahmad
    ACM/IEEE 25TH INTERNATIONAL CONFERENCE ON MODEL DRIVEN ENGINEERING LANGUAGES AND SYSTEMS, MODELS 2022 COMPANION, 2022, : 205 - 208
  • [23] Domain-Specific Language for Context-Aware Web Applications
    Nebeling, Michael
    Grossniklaus, Michael
    Leone, Stefania
    Norrie, Moira C.
    WEB INFORMATION SYSTEM ENGINEERING-WISE 2010, 2010, 6488 : 471 - 479
  • [24] Domain-specific queries and Web search personalization: some investigations
    Van Tien Hoang
    Spognardi, Angelo
    Tiezzi, Francesco
    Petrocchi, Marinella
    De Nicola, Rocco
    ELECTRONIC PROCEEDINGS IN THEORETICAL COMPUTER SCIENCE, 2015, (188): : 51 - 58
  • [25] A Domain-Specific Web Document Re-Ranking Algorithm
    Zhao, Grace
    Zhang, Xiaowen
    2017 6TH IIAI INTERNATIONAL CONGRESS ON ADVANCED APPLIED INFORMATICS (IIAI-AAI), 2017, : 385 - 390
  • [26] Generation of classifier for domain-specific hidden web search interface
    Yuan, WC
    Zuo, WL
    Xu, QY
    PROCEEDINGS OF THE 11TH JOINT INTERNATIONAL COMPUTER CONFERENCE, 2005, : 657 - 660
  • [27] Automatic generation of domain-specific ontology from deep web
    Chen, Kerui
    Zuo, Wanli
    Zhang, Fan
    He, Fengling
    Peng, Tao
    Journal of Information and Computational Science, 2010, 7 (02): : 519 - 525
  • [28] Domain-specific web service discovery with service class descriptions
    Rocco, D
    Caverlee, J
    Liu, L
    Critchlow, T
    2005 IEEE International Conference on Web Services, Vols 1 and 2, Proceedings, 2005, : 481 - 488
  • [29] Generating domain-specific web-based expert systems
    Dunstan, Neil
    EXPERT SYSTEMS WITH APPLICATIONS, 2008, 35 (03) : 686 - 690
  • [30] On Web-based Domain-Specific Language for Internet of Things
    Sneps-Sneppe, Manfred
    Namiot, Dmitry
    2015 7TH INTERNATIONAL CONGRESS ON ULTRA MODERN TELECOMMUNICATIONS AND CONTROL SYSTEMS AND WORKSHOPS (ICUMT), 2015, : 287 - 292