Unsupervised Domain Ranking in Large-Scale Web Crawls

被引：0

作者：

Cui, Yi ^{[1
]}

Sparkman, Clint ^{[2
]}

Lee, Hsin-Tsang ^{[3
]}

Loguinov, Dmitri ^{[1
]}

机构：

[1] Texas A&M Univ, Dept Comp Sci & Engn, College Stn, TX 77843 USA

[2] US Air Force Acad, 2354 Fairchild Dr,Suite 6G-101, Colorado Springs, CO 80840 USA

[3] Microsoft Corp, Redmond, WA 98052 USA

来源：

ACM TRANSACTIONS ON THE WEB | 2018年 / 12卷 / 04期

关键词：

Web crawling; ranking; frontier prioritization; DESIGN;

D O I：

10.1145/3182180

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

With the proliferation of web spam and infinite autogenerated web content, large-scale web crawlers require low-complexity ranking methods to effectively budget their limited resources and allocate bandwidth to reputable sites. In this work, we assume crawls that produce frontiers orders of magnitude larger than RAM, where sorting of pending URLs is infeasible in real time. Under these constraints, the main objective is to quickly compute domain budgets and decide which of them can be massively crawled. Those ranked at the top of the list receive aggressive crawling allowances, while all other domains are visited at some small default rate. To shed light on Internet-wide spans avoidance, we study topology-based ranking algorithms on domain-level graphs from the two largest academic crawls: a 6.3B-page IRLbot dataset and a 1B-page ClueWeb09 exploration. We first propose a new methodology for comparing the various rankings and then show that in-degree BFS-based techniques decisively outperform classic PageRank-style methods, including TrustRank. However, since BFS requires several orders of magnitude higher overhead and is generally infeasible for real-time use, we propose a fast, accurate, and scalable estimation method called TSE that can achieve much better crawl prioritization in practice. It is especially beneficial in applications with limited hardware resources.

引用

页数：29

共 50 条

[1] Agnostic Topology-Based Spam Avoidance in Large-Scale Web Crawls
Sparkman, Clint
Lee, Hsin-Tsang
Loguinov, Dmitri
[J]. 2011 PROCEEDINGS IEEE INFOCOM, 2011, : 811 - 819
[2] Limiting Large-scale Crawls of Social Networking Sites
Mondal, Mainack
Viswanath, Bimal
Clement, Allen
Druschel, Peter
Gummadi, Krishna P.
Mislove, Alan
Post, Ansley
[J]. ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2011, 41 (04) : 398 - 399
[3] Towards Large-Scale Unsupervised Relation Extraction from the Web
Min, Bonan
Shi, Shuming
Grishman, Ralph
Lin, Chin-Yew
[J]. INTERNATIONAL JOURNAL ON SEMANTIC WEB AND INFORMATION SYSTEMS, 2012, 8 (03) : 1 - 23
[4] Large-Scale Unsupervised Semantic Segmentation
Gao, Shanghua
Li, Zhong-Yu
Yang, Ming-Hsuan
Cheng, Ming-Ming
Han, Junwei
Torr, Philip
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7457 - 7476
[5] Large-Scale Unsupervised Object Discovery
Vo, Huy V.
Sizikova, Elena
Schmid, Cordelia
Perez, Patrick
Ponce, Jean
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[6] Decentralized Ranking in Large-Scale Overlay Networks
Montresor, Alberto
Jelasity, Mark
Babaoglu, Ozalp
[J]. SASOW 2008: SECOND IEEE INTERNATIONAL CONFERENCE ON SELF-ADAPTIVE AND SELF-ORGANIZING SYSTEMS WORKSHOPS, PROCEEDINGS, 2008, : 208 - +
[7] Fast Unsupervised Projection for Large-Scale Data
Wang, Jingyu
Wang, Lin
Nie, Feiping
Li, Xuelong
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (08) : 3634 - 3644
[8] Large-Scale Web Page Classification
Marath, Sathi T.
Shepherd, Michael
Milios, Evangelos
Duffy, Jack
[J]. 2014 47TH HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES (HICSS), 2014, : 1813 - 1822
[9] Large-Scale Web Data Analysis
Leskovec, Jure
[J]. IEEE INTELLIGENT SYSTEMS, 2011, 26 (01) : 11 - 11
[10] Linguistics in large-scale Web search
Gulla, JA
Auran, PG
Risvik, KM
[J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2002, 2553 : 218 - 222

← 1 2 3 4 5 →