Archival Crawlers and Java']JavaScript: Discover More Stuff but Crawl More Slowly

被引:0
|
作者
Brunelle, Justin F. [1 ,2 ]
Weigle, Michele C. [2 ]
Nelson, Michael L. [2 ]
机构
[1] Mitre Corp, 7525 Colshire Dr, Mclean, VA 22101 USA
[2] Old Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
基金
美国国家科学基金会;
关键词
Web Archiving; Digital Preservation; Memento; Web Crawling;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The web is today's primary publication medium, making web archiving an important activity for historical and analytical purposes. Web pages are increasingly interactive, resulting in pages that are correspondingly difficult to archive. JavaScript enables interactions that can potentially change the client-side state of a representation. We refer to representations that load embedded resources via JavaScript as deferred representations. It is difficult to discover and crawl all of the resources in deferred representations and the result of archiving deferred representations is archived web pages that are either incomplete or erroneously load embedded resources from the live web. We propose a method of discovering and archiving deferred representations and their descendants (representation states) that are only reachable through client-side events. Our approach identified an average of 38.5 descendants per seed URI crawled, 70.9% of which are reached through an onclick event. This approach also added 15.6 times more embedded resources than Heritrix to the crawl frontier, but at a crawl rate that was 38.9 times slower than simply using Heritrix. If our method was applied to the July 2015 Common Crawl dataset, a web-scale archival crawler will discover an additional 7.17 PB (5.12 times more) of information per year. This illustrates the significant increase in resources necessary for more thorough archival crawls.
引用
收藏
页码:1 / 10
页数:10
相关论文
共 3 条
  • [1] Static Typing & Java']JavaScript Libraries: Towards a More Considerate Relationship
    Canou, Benjamin
    Chailloux, Emmanuel
    Botbol, Vincent
    [J]. PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'13 COMPANION), 2013, : 15 - 17
  • [2] Multiple Classifier Systems for More Accurate Java']JavaScript Malware Detection
    Yi, Zibo
    Ma, Jun
    Luo, Lei
    Yu, Jie
    Wu, Qingbo
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PROMOTION OF INFORMATION TECHNOLOGY (ICPIT 2016), 2016, 66 : 139 - 143
  • [3] Using Java']JavaScript enhanced HTML']HTML to create a more interactive and collaborative learning environment.
    Maniscalco, IA
    DiLullo, D
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 1998, 216 : U492 - U492