Learning to Discover Domain-Specific Web Content

被引:4
|
作者
Pham, Kien [1 ]
Santos, Aecio [1 ]
Freire, Juliana [1 ]
机构
[1] NYU, New York, NY 10003 USA
关键词
D O I
10.1145/3159652.3159724
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The ability to discover all content relevant to an information domain has many applications, from helping in the understanding of humanitarian crises to countering human and arms trafficking. In such applications, time is of essence: it is crucial to both maximize coverage and identify new content as soon as it becomes available, so that appropriate actions can be taken. In this paper, we propose new methods for efficient domain-specific re-crawling that maximize the yield for new content. By learning patterns of pages that have a high yield, our methods select a small set of pages that can be re-crawled frequently, increasing the coverage and freshness while conserving resources. Unlike previous approaches to this problem, our methods combine different factors to optimize the re-crawling strategy, do not require full snapshots for the learning step, and dynamically adapt the strategy as the crawl progresses. In an empirical evaluation, we have simulated the framework over 600 partial crawl snapshots in three different domains. The results show that our approach can achieve 150% higher coverage compared to existing, state-of-the-art techniques. In addition, it is also able to capture 80% of new relevant content within less than 4 hours of publication.
引用
收藏
页码:432 / 440
页数:9
相关论文
共 50 条
  • [31] Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale
    Rheinlaender, Astrid
    Lehmann, Mario
    Kunkel, Anja
    Meier, Joerg
    Leser, Ulf
    SIGMOD'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2016, : 759 - 771
  • [32] Extracting Web Business Information Based on Domain-Specific Ontology
    Shen, J.
    Bi, L.
    Xu, F. Y.
    He, K.
    Wei, L. H.
    Zhu, Y.
    ITESS: 2008 PROCEEDINGS OF INFORMATION TECHNOLOGY AND ENVIRONMENTAL SYSTEM SCIENCES, PT 1, 2008, : 997 - 1003
  • [33] Extraction of Query Interfaces for Domain-Specific Hidden Web Crawler
    Gupta, Nupur
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2016, 16 (02): : 124 - 127
  • [34] Building Web navigation agents using domain-specific ontologies
    Yang, JY
    Jung, HS
    Choi, J
    INTELLIGENT AGENTS AND MULTI-AGENT SYSTEMS, 2005, 3371 : 303 - 316
  • [35] Web Site Modeling and Prototyping Based on a Domain-Specific Language
    Stibe, Agnis
    Bicevskis, Janis
    BALTIC JOURNAL OF MODERN COMPUTING, 2009, 751 : 7 - 21
  • [36] SWQL: A new domain-specific language for mining the social Web
    Guzman-Guzman, Xiomarah
    Rolando Nunez-Valdez, Edward
    Vasquez-Reynoso, Raysa
    Asencio, Angel
    Garcia-Diaz, Vicente
    SCIENCE OF COMPUTER PROGRAMMING, 2021, 207
  • [37] LEARNING DOMAIN-SPECIFIC HEURISTICS FOR ANSWER SET SOLVERS
    Balduccini, Marcello
    TECHNICAL COMMUNICATIONS OF THE 26TH INTERNATIONAL CONFERENCE ON LOGIC PROGRAMMING (ICLP'10), 2010, 7 : 14 - 23
  • [38] Lifelong Learning of Topics and Domain-Specific Word Embeddings
    Qin, Xiaorui
    Lu, Yuyin
    Chen, Yufu
    Rao, Yanghui
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2294 - 2309
  • [39] Domain-specific and domain-general constraints on word and sequence learning
    Archibald, Lisa M. D.
    Joanisse, Marc F.
    MEMORY & COGNITION, 2013, 41 (02) : 268 - 280
  • [40] Learning and using domain-specific heuristics in ASP solvers
    Balduccini, Marcello
    AI COMMUNICATIONS, 2011, 24 (02) : 147 - 164