A method for indexing web pages using web bots

被引:0
|
作者
Szymanski, BK [1 ]
Chung, MS [1 ]
机构
[1] Rensselaer Polytech Inst, Dept Comp Sci, Troy, NY 12180 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Exploring the content of web pages for automatic indexing is of fundamental importance for efficient e-commerce and other applications of the Web. It enables users, including customers and businesses, to locate the best sources for their use. Today's search engines use one of two approaches to indexing web pages. They either (i) analyze the frequency of the words (after filtering Out common or meaningless words) appearing in the entire or a part (typically, a title, an abstract or the first 300 words) of the text of the target web page, or (ii) they use sophisticated algorithms to take into account associations of words in the indexed web page. In both cases only words appearing in the web page in question are used in analysis. Often, to increase relevance of the selected terms to the potential searches, the indexing is refined by human processing. To identify so called "authority," or "expert" pages, some search engines use the structure of the links between pages to identify, pages that are often referenced by other pages. Analyzing the density, direction and clustering of links, this method is capable of identifying the pages that are likely to contain valuable information. It is analogous to a well known citation analysis method developed in library sciences and used by such publications as the Science Citation Index. A slightly different approach is used in the Google Search Engine implementation which assigns to each page a score that depends on frequency with which this page is visited by web surfers. The basic difference between the existing methods and the one discussed here is that these methods rely on a structure of web page linkages that lead from or to the indexed page. In contrast, our method uses the content of the pages linked to or from the indexed page for indexing. So our method uses a structure of words used by the linked pages, whereas the current methods use the structure of the connections between linked pages. In this paper we propose and demonstrate usage of a new method based on bots which analyze content of the pages linked to or from the page of interest. We analyze the similarity of the word usage at the different link distance from tile Page of interest and demonstrate that a structure of words used by the linked pages enables more efficient indexing and search.
引用
收藏
页码:C1 / C6
页数:6
相关论文
共 50 条
  • [1] Indexing Temporal Information for Web Pages
    Jin, Peiquan
    Chen, Hong
    Zhao, Xujian
    Li, Xiaowen
    Yue, Lihua
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2011, 8 (03) : 711 - 737
  • [2] Indexing by Permeability in Block Structured Web Pages
    Bruno, Emmanuel
    Faessel, Nicolas
    Glotin, Herve
    Le Maitre, Jacques
    Scholl, Michel
    DOCENG'09: PROCEEDINGS OF THE 2009 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, 2009, : 70 - 73
  • [3] Page Ranking Algorithms in Web Mining, Limitations of Existing methods and a New Method for Indexing Web Pages
    Jain, Ashish
    Sharma, Rajeev
    Dixit, Gireesh
    Tomar, Varsha
    2013 INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS AND NETWORK TECHNOLOGIES (CSNT 2013), 2013, : 640 - 645
  • [4] Indexing and querying segmented web pages: the BlockWeb Model
    Emmanuel Bruno
    Nicolas Faessel
    Hervé Glotin
    Jacques Le Maitre
    Michel Scholl
    World Wide Web, 2011, 14 : 623 - 649
  • [5] Indexing and querying segmented web pages: the BlockWeb Model
    Bruno, Emmanuel
    Faessel, Nicolas
    Glotin, Herve
    Le Maitre, Jacques
    Scholl, Michel
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2011, 14 (5-6): : 623 - 649
  • [6] Using the web infrastructure to preserve web pages
    Nelson, Michael L.
    McCown, Frank
    Smith, Joan A.
    Klein, Martin
    INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2007, 6 (04) : 327 - 349
  • [7] Structural and Semantic Indexing for Supporting Creation of Multilingual Web Pages
    Urae, Hiroshi
    Tezuka, Taro
    Kimura, Fuminori
    Maeda, Akira
    INTERNATIONAL MULTICONFERENCE OF ENGINEERS AND COMPUTER SCIENTISTS, IMECS 2012, VOL I, 2012, : 662 - 667
  • [8] An RDF-based framework for Semantic Indexing of web pages
    Amato, F.
    Moscato, V.
    Persia, F.
    Picariello, A.
    Gargiulo, F.
    2013 IEEE SEVENTH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2013), 2013, : 395 - +
  • [9] Using the web information structure for retrieving web pages
    Adriani, Mirna
    Pandugita, Rama
    ACCESSING MULTILINGUAL INFORMATION REPOSITORIES, 2006, 4022 : 892 - 897
  • [10] Verification of the web applications using sink web pages
    Popescu, Doru Anastasiu
    Danauta, Catrinel Maria
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON VIRTUAL LEARNING, ICVL 2011, 2011, : 485 - 491