UniCrawl: A Practical Geographically Distributed Web Crawler

被引:8
|
作者
Le Quoc, Do [1 ]
Fetzer, Christof [1 ]
Felber, Pascal [2 ]
Riviere, Etienne [2 ]
Schiavoni, Valerio [2 ]
Sutra, Pierre [2 ]
机构
[1] Tech Univ Dresden, Syst Engn Grp, Dresden, Germany
[2] Univ Neuchatel, Neuchatel, Switzerland
关键词
web crawler; geo-distributed system; cloud federation; storage; map-reduce;
D O I
10.1109/CLOUD.2015.59
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As the wealth of information available on the web keeps growing, being able to harvest massive amounts of data has become a major challenge. Web crawlers are the core components to retrieve such vast collections of publicly available data. The key limiting factor of any crawler architecture is however its large infrastructure cost. To reduce this cost, and in particular the high upfront investments, we present in this paper a geo-distributed crawler solution, UniCrawl. UniCrawl orchestrates several geographically distributed sites. Each site operates an independent crawler and relies on well-established techniques for fetching and parsing the content of the web. UniCrawl splits the crawled domain space across the sites and federates their storage and computing resources, while minimizing thee inter-site communication cost. To assess our design choices, we evaluate UniCrawl in a controlled environment using the ClueWeb12 dataset, and in the wild when deployed over several remote locations. We conducted several experiments over 3 sites spread across Germany. When compared to a centralized architecture with a crawler simply stretched over several locations, UniCrawl shows a performance improvement of 93.6% in terms of network bandwidth consumption, and a speedup factor of 1.75.
引用
收藏
页码:389 / 396
页数:8
相关论文
共 50 条
  • [1] Design of the Distributed Web Crawler
    Chen, Xing
    Li, Weijiang
    Zhao, Tiejun
    Piao, Xinghai
    [J]. ADVANCED RESEARCH ON INDUSTRY, INFORMATION SYSTEMS AND MATERIAL ENGINEERING, PTS 1-7, 2011, 204-210 : 1454 - +
  • [2] Smart distributed web crawler
    Bal, Sawroop Kaur
    Geetha, G.
    [J]. 2016 INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND EMBEDDED SYSTEMS (ICICES), 2016,
  • [3] Framework for Distributed Semantic Web Crawler
    Kumar, Naresh
    Singh, Manjeet
    [J]. 2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS (CICN), 2015, : 1403 - 1407
  • [4] Implementation of A Distributed Web Community Crawler
    Park, Seonyoung
    Lee, Youngseok
    [J]. 2014 16TH ASIA-PACIFIC NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM (APNOMS), 2014,
  • [5] Dis-Dyn Crawler: A Distributed Crawler for Dynamic Web Page
    Cai, Jianfu
    Zhang, Hua
    [J]. PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON MECHATRONICS, MATERIALS, CHEMISTRY AND COMPUTER ENGINEERING 2015 (ICMMCCE 2015), 2015, 39 : 2623 - 2626
  • [6] UbiCrawler: A scalable fully distributed Web crawler
    Dipto. di Scienze dell'Informazione, Univ. degli Studi di Milano, via Comelico 39/41, I-20135 Milano, Italy
    不详
    不详
    不详
    [J]. 1600, 711-726 (June 10, 2004):
  • [7] UbiCrawler: a scalable fully distributed Web crawler
    Boldi, P
    Codenotti, B
    Santini, M
    Vigna, S
    [J]. SOFTWARE-PRACTICE & EXPERIENCE, 2004, 34 (08): : 711 - 726
  • [8] Design and implementation of a full distributed web crawler
    Zhu, Kunpeng
    Wang, Xiaolong
    Liu, Yuanchao
    [J]. Journal of Computational Information Systems, 2009, 5 (04): : 1081 - 1088
  • [9] IglooG: A distributed web crawler based on grid service
    Liu, F
    Ma, FY
    Ye, YM
    Li, ML
    Yu, JD
    [J]. WEB TECHNOLOGIES RESEARCH AND DEVELOPMENT - APWEB 2005, 2005, 3399 : 207 - 216
  • [10] A distributed Web Crawler design and Java']Java implementation
    Ma, FY
    Zhang, L
    Ye, YM
    Yu, S
    Song, H
    [J]. WORLD WIDE WEB TECHNOLOGIES IN CHINA: RESEARCH, DEVELOPMENT, AND APPLICATIONS, 2002, : 36 - 49