Unsupervised Domain Ranking in Large-Scale Web Crawls

被引:0
|
作者
Cui, Yi [1 ]
Sparkman, Clint [2 ]
Lee, Hsin-Tsang [3 ]
Loguinov, Dmitri [1 ]
机构
[1] Texas A&M Univ, Dept Comp Sci & Engn, College Stn, TX 77843 USA
[2] US Air Force Acad, 2354 Fairchild Dr,Suite 6G-101, Colorado Springs, CO 80840 USA
[3] Microsoft Corp, Redmond, WA 98052 USA
关键词
Web crawling; ranking; frontier prioritization; DESIGN;
D O I
10.1145/3182180
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the proliferation of web spam and infinite autogenerated web content, large-scale web crawlers require low-complexity ranking methods to effectively budget their limited resources and allocate bandwidth to reputable sites. In this work, we assume crawls that produce frontiers orders of magnitude larger than RAM, where sorting of pending URLs is infeasible in real time. Under these constraints, the main objective is to quickly compute domain budgets and decide which of them can be massively crawled. Those ranked at the top of the list receive aggressive crawling allowances, while all other domains are visited at some small default rate. To shed light on Internet-wide spans avoidance, we study topology-based ranking algorithms on domain-level graphs from the two largest academic crawls: a 6.3B-page IRLbot dataset and a 1B-page ClueWeb09 exploration. We first propose a new methodology for comparing the various rankings and then show that in-degree BFS-based techniques decisively outperform classic PageRank-style methods, including TrustRank. However, since BFS requires several orders of magnitude higher overhead and is generally infeasible for real-time use, we propose a fast, accurate, and scalable estimation method called TSE that can achieve much better crawl prioritization in practice. It is especially beneficial in applications with limited hardware resources.
引用
收藏
页数:29
相关论文
共 50 条
  • [1] Agnostic Topology-Based Spam Avoidance in Large-Scale Web Crawls
    Sparkman, Clint
    Lee, Hsin-Tsang
    Loguinov, Dmitri
    [J]. 2011 PROCEEDINGS IEEE INFOCOM, 2011, : 811 - 819
  • [2] Limiting Large-scale Crawls of Social Networking Sites
    Mondal, Mainack
    Viswanath, Bimal
    Clement, Allen
    Druschel, Peter
    Gummadi, Krishna P.
    Mislove, Alan
    Post, Ansley
    [J]. ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2011, 41 (04) : 398 - 399
  • [3] Towards Large-Scale Unsupervised Relation Extraction from the Web
    Min, Bonan
    Shi, Shuming
    Grishman, Ralph
    Lin, Chin-Yew
    [J]. INTERNATIONAL JOURNAL ON SEMANTIC WEB AND INFORMATION SYSTEMS, 2012, 8 (03) : 1 - 23
  • [4] Large-Scale Unsupervised Semantic Segmentation
    Gao, Shanghua
    Li, Zhong-Yu
    Yang, Ming-Hsuan
    Cheng, Ming-Ming
    Han, Junwei
    Torr, Philip
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7457 - 7476
  • [5] Large-Scale Unsupervised Object Discovery
    Vo, Huy V.
    Sizikova, Elena
    Schmid, Cordelia
    Perez, Patrick
    Ponce, Jean
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [6] Decentralized Ranking in Large-Scale Overlay Networks
    Montresor, Alberto
    Jelasity, Mark
    Babaoglu, Ozalp
    [J]. SASOW 2008: SECOND IEEE INTERNATIONAL CONFERENCE ON SELF-ADAPTIVE AND SELF-ORGANIZING SYSTEMS WORKSHOPS, PROCEEDINGS, 2008, : 208 - +
  • [7] Fast Unsupervised Projection for Large-Scale Data
    Wang, Jingyu
    Wang, Lin
    Nie, Feiping
    Li, Xuelong
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (08) : 3634 - 3644
  • [8] Large-Scale Web Page Classification
    Marath, Sathi T.
    Shepherd, Michael
    Milios, Evangelos
    Duffy, Jack
    [J]. 2014 47TH HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES (HICSS), 2014, : 1813 - 1822
  • [9] Large-Scale Web Data Analysis
    Leskovec, Jure
    [J]. IEEE INTELLIGENT SYSTEMS, 2011, 26 (01) : 11 - 11
  • [10] Linguistics in large-scale Web search
    Gulla, JA
    Auran, PG
    Risvik, KM
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2002, 2553 : 218 - 222