Improving the Efficiency of Multi-site Web Search Engines

被引:14
|
作者
Frances, Guillem [1 ]
Bai, Xiao [2 ]
Cambazoglu, B. Barla [2 ]
Baeza-Yates, Ricardo [2 ]
机构
[1] Univ Pompeu Fabra, Barcelona, Spain
[2] Yahoo Labs, Barcelona, Spain
关键词
Distributed web search; query processing; document replication; query forwarding; result caching; efficiency; PERFORMANCE;
D O I
10.1145/2556195.2556249
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A multi-site web search engine is composed of a number of search sites geographically distributed around the world. Each search site is typically responsible for crawling and indexing the web pages that are in its geographical neighborhood. A query is selectively processed on a subset of search sites that are predicted to return the best-matching results. The scalability and efficiency of multi-site web search engines have attracted a lot of research attention in recent years. In particular, research has focused on replicating important web pages across sites, forwarding queries to relevant sites, and caching results of previous queries. Yet, these problems have only been studied in isolation, but no prior work has properly investigated the interplay between them. In this paper, we take this challenge up and conduct what we believe is the first comprehensive analysis of a full stack of techniques for efficient multi-site web search. Specifically, we propose a document replication technique that improves the query locality of the state-of-the-art approaches with various replication budget distribution strategies. We devise a machine learning approach to decide the query forwarding patterns, achieving a significantly lower false positive ratio than a state-of-the-art thresholding approach with little negative impact on search result quality. We propose three result caching strategies that reduce the number of forwarded queries and analyze the trade-off they introduce in terms of storage and network overheads. Finally, we show that the combination of the best-of-the-class techniques yields very promising search efficiency, rendering multi-site, geographically distributed web search engines an attractive alternative to centralized web search engines.
引用
收藏
页码:3 / 12
页数:10
相关论文
共 50 条
  • [1] Improving the engineering of Web search engines
    Ramadhan, HA
    Shihab, K
    [J]. IC'2000: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTERNET COMPUTING, 2000, : 29 - 33
  • [2] A multi-agent system for improving result ranking service of web search engines
    Kanawati, R
    [J]. CONCURRENT ENGINEERING: ENHANCED INTEROPERABLE SYSTEMS, 2003, : 91 - 98
  • [3] Application of best practice towards improving Web site visibility to search engines: a pilot study
    Weideman, M.
    Chambers, R.
    [J]. SOUTH AFRICAN JOURNAL OF INFORMATION MANAGEMENT, 2005, 7 (04):
  • [4] Web Site Optimization for Search Engines An Empirical Study
    Silva, Nuno
    Aguiar, Antonio
    [J]. PROCEEDINGS OF THE 2014 9TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI 2014), 2014,
  • [5] Search analytics: A guide to analyzing and optimizing web site search engines
    Wiley, Deborah Lynne
    [J]. ONLINE, 2007, 31 (01): : 62 - 62
  • [6] Scalability and Efficiency Challenges in Commercial Web Search Engines
    Barla Cambazoglu, B.
    Baeza-Yates, Ricardo
    [J]. SIGIR'13: THE PROCEEDINGS OF THE 36TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH & DEVELOPMENT IN INFORMATION RETRIEVAL, 2013, : 1124 - 1124
  • [7] Multi-tier architecture for web search engines
    Risvik, KM
    Aasheim, Y
    Lidal, M
    [J]. FIRST LATIN AMERICAN WEB CONGRESS, PROCEEDINGS, 2003, : 132 - 143
  • [8] Planets in CM Draconis: A multi-site photometric search
    Martin, EL
    Deeg, H
    Chevreton, M
    Schneider, J
    Doyle, L
    Jenkins, J
    Palaiologou, E
    Lee, W
    [J]. INFRARED SPACE INTERFEROMETRY: ASTROPHYSICS & THE STUDY OF EARTH-LIKE PLANETS, 1997, 215 : 59 - 61
  • [9] Address RF Multi-site Test Efficiency Challenge
    Ge, Liang
    [J]. CHINA SEMICONDUCTOR TECHNOLOGY INTERNATIONAL CONFERENCE 2010 (CSTIC 2010), 2010, 27 (01): : 191 - 196
  • [10] Web search engines
    Schwartz, C
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1998, 49 (11): : 973 - 982