A large-scale study of the evolution of Web pages

被引:62
|
作者
Fetterly, D
Manasse, M
Najork, M
Wiener, JL
机构
[1] Microsoft Res, Mountain View, CA 94043 USA
[2] Hewlett Packard Labs, Mountain View, CA 94043 USA
来源
SOFTWARE-PRACTICE & EXPERIENCE | 2004年 / 34卷 / 02期
关键词
Web characterization; Web evolution; Web pages; rate of change; degree of change;
D O I
10.1002/spe.577
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
How fast does the Web change? Does most of the content remain unchanged once it has been authored, or are the documents continuously updated? Do pages change a little or a lot? Is the extent of change correlated to any other property of the page? All of these questions are of interest to those who mine the Web, including all the popular search engines, but few studies have been performed to date to answer them. One notable exception is a study by Cho and Garcia-Molina, who crawled a set of 720000 pages on a daily basis over 4 months, and counted pages as having changed if their MD5 checksum changed. They found that 40% of all Web pages in their set changed within a week, and 23% of those pages that fell into the .com domain changed daily. This paper expands on Cho and Garcia-Molina's study, both in terms of coverage and in terms of sensitivity to change. We crawled a set of 150836209 HTML pages once every week, over a span of 11 weeks. For each page, we recorded a checksum of the page, and a feature vector of the words on the page, plus various other data such as the page length, the HTTP status code, etc. Moreover, we pseudo-randomly selected 0.1% of all of our URLs, and saved the full text of each download of the corresponding pages. After completion of the crawl, we analyzed the degree of change of each page, and investigated which factors are correlated with change intensity. We found that the average degree of change varies widely across top-level domains, and that larger pages change more often and more severely than smaller ones. This paper describes the crawl and the data transformations we performed on the logs, and presents some statistical observations on the degree of change of different classes of pages. Copyright (C) 2004 John Wiley Sons, Ltd.
引用
收藏
页码:213 / U3
页数:28
相关论文
共 50 条
  • [1] Large-Scale Location Prediction for Web Pages
    Hu, Yuening
    Kang, Changsung
    Tang, Jiliang
    Yin, Dawei
    Chang, Yi
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (09) : 1902 - 1915
  • [2] A Perspective of Evolution After Five Years: A Large-Scale Study of Web Spam Evolution
    Wang, De
    Irani, Danesh
    Pu, Calton
    [J]. INTERNATIONAL JOURNAL OF COOPERATIVE INFORMATION SYSTEMS, 2014, 23 (02)
  • [3] Large-scale study of web accessibility metrics
    Beatriz Martins
    Carlos Duarte
    [J]. Universal Access in the Information Society, 2024, 23 : 411 - 434
  • [4] Large-scale study of web accessibility metrics
    Martins, Beatriz
    Duarte, Carlos
    [J]. UNIVERSAL ACCESS IN THE INFORMATION SOCIETY, 2024, 23 (01) : 411 - 434
  • [5] Generating associated knowledge flow in large-scale web pages based on user interaction
    Anhui University of Science and Technology, China
    [J]. Comput Syst Sci Eng, 5 (377-389):
  • [6] Generating associated knowledge flow in large-scale web pages based on user interaction
    Zhang, Shunxiang
    Lu, Kui
    Liu, Wenjuan
    Yin, Xiaobo
    Zhu, Guangli
    [J]. COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2015, 30 (05): : 377 - 389
  • [7] Evading user-specific Offensive Web Pages via large-scale collaborations
    Xu, Mingwei
    Li, Qinghua
    Jiang, XueZhi
    Cui, Yong
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, PROCEEDINGS, VOLS 1-13, 2008, : 5721 - +
  • [8] A Large-scale Empirical Assessment of Web API Size Evolution
    Di Lauro, Fabio
    Serbout, Souhaila
    Pautasso, Cesare
    [J]. JOURNAL OF WEB ENGINEERING, 2022, 21 (06): : 1937 - 1979
  • [9] Towards Large-Scale Empirical Assessment of Web APIs Evolution
    Di Lauro, Fabio
    Serbout, Souhaila
    Pautasso, Cesare
    [J]. WEB ENGINEERING, ICWE 2021, 2021, 12706 : 124 - 138
  • [10] A Large-Scale Study of Test Coverage Evolution
    Hilton, Michael
    Bell, Jonathan
    Marinov, Darko
    [J]. PROCEEDINGS OF THE 2018 33RD IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMTED SOFTWARE ENGINEERING (ASE' 18), 2018, : 53 - 63