Using HMM to learn user browsing patterns for focused Web crawling

被引:44
|
作者
Liu, Hongyu
Janssen, Jeannette
Millos, Evangelos
机构
[1] Dalhousie Univ, Fac Comp Sci, Halifax, NS B3H 1W5, Canada
[2] Dalhousie Univ, Dept Math & Stat, Halifax, NS B3H 1W5, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
focused crawling; Web searching; relevance modelling; user modelling; pattern learning; Hidden Markov models; World Wide Web; Web Graph;
D O I
10.1016/j.datak.2006.01.012
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A focused crawler is designed to traverse the Web to gather documents on a specific topic. It can be used to build domain-specific Web search portals and online personalized search tools. To estimate the relevance of a newly seen URL, it must use information gleaned from previously crawled page sequences. In this paper, we present a new approach for prediction of the links leading to relevant pages based on a Hidden Markov Model (HMM). The system consists of three stages: user data collection, user modelling via sequential pattern learning, and focused crawling. In particular, we first collect the Web pages visited during a user browsing session. These pages are clustered, and the link structure among pages from different clusters is then used to learn page sequences that are likely to lead to target pages. The learning is performed using HMM. During crawling, the priority of links to follow is based on a learned estimate of how likely the page is to lead to a target page. We compare the performance with Context-Graph crawling and Best-First crawling. Our experiments demonstrate that this approach performs better than Context-Graph crawling and Best-First crawling. (c) 2006 Elsevier B.V. All rights reserved.
引用
收藏
页码:270 / 291
页数:22
相关论文
共 50 条
  • [41] On the uniqueness of Web browsing history patterns
    Lukasz Olejnik
    Claude Castelluccia
    Artur Janc
    annals of telecommunications - annales des télécommunications, 2014, 69 : 63 - 74
  • [42] Mining of generalized web browsing patterns
    Wang, SL
    Lo, WS
    Hong, TP
    7TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL II, PROCEEDINGS: COMPUTER SCIENCE AND ENGINEERING, 2003, : 267 - 271
  • [43] iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling
    Gossen, Gerhard
    Demidova, Elena
    Risse, Thomas
    PROCEEDINGS OF THE 15TH ACM/IEEE-CS JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL'15), 2015, : 75 - 84
  • [44] Focused web crawling strategy based on web semantic analysis and web link analysis
    Xihua University Archives, Chengdu, Sichuan, 610039, China
    不详
    J. Comput. Inf. Syst., 2009, 6 (1793-1800):
  • [45] A focused crawling for the web resource discovery using a modified proximal support vector machines
    Choi, YS
    Kim, KJ
    Kang, MS
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2005, PT 1, 2005, 3480 : 186 - 194
  • [46] A Word Embedding Based Approach for Focused Web Crawling Using the Recurrent Neural Network
    Dhanith, P. R. Joe
    Surendiran, B.
    Raja, S. P.
    INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2021, 6 (06): : 122 - 132
  • [47] Focused crawling using fictitious play
    Könönen, V
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2002, 2002, 2412 : 186 - 192
  • [48] Future view: Web navigation based on learning user's browsing patterns by classifier systems
    Nagino, N
    Yamada, S
    CEC: 2003 CONGRESS ON EVOLUTIONARY COMPUTATION, VOLS 1-4, PROCEEDINGS, 2003, : 2829 - 2836
  • [49] Ontology-based focused crawling of Deep Web sources
    Fang, Wei
    Cui, Zhiming
    Zhao, Pengpeng
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2007, 4798 : 514 - 519
  • [50] An adaptive focused Web crawling algorithm based on learning automata
    Torkestani, Javad Akbari
    APPLIED INTELLIGENCE, 2012, 37 (04) : 586 - 601