Optimal algorithms for finding user access sessions from very large web logs

被引:22
|
作者
Chen, ZX [1 ]
Fu, AWC
Tong, FCH
机构
[1] Univ Texas Pan Amer, Dept Comp Sci, Edinburg, TX USA
[2] Chinese Univ Hong Kong, Dept Comp Sci, Hong Kong, Peoples R China
[3] Univ Hong Kong, Dept Comp Sci & Informat Syst, Hong Kong, Peoples R China
关键词
web log mining; data preparation; user access sessions; data structures; time complexity;
D O I
10.1023/A:1024606901978
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Although efficient identification of user access sessions from very large web logs is an unavoidable data preparation task for the success of higher level web log mining, little attention has been paid to algorithmic study of this problem. In this paper we consider two types of user access sessions, interval sessions and gap sessions. We design two efficient algorithms for finding respectively those two types of sessions with the help of some proposed structures. We present theoretical analysis of the algorithms and prove that both algorithms have optimal time complexity and certain error-tolerant properties as well. We conduct empirical performance analysis of the algorithms with web logs ranging from 100 megabytes to 500 megabytes. The empirical analysis shows that the algorithms just take several seconds more than the baseline time, i.e., the time needed for reading the web log once sequentially from disk to RAM, testing whether each user access record is valid or not, and writing each valid user access record back to disk. The empirical analysis also shows that our algorithms are substantially faster than the sorting based session finding algorithms. Finally, optimal algorithms for finding user access sessions from distributed web logs are also presented.
引用
收藏
页码:259 / 279
页数:21
相关论文
共 50 条
  • [1] Optimal Algorithms for Finding User Access Sessions from Very Large Web Logs
    Zhixiang Chen
    Ada Wai-Chee Fu
    Frank Chi-Hung Tong
    [J]. World Wide Web, 2003, 6 : 259 - 279
  • [2] Identifying user sessions from web server logs with integer programming
    Roman, Pablo E.
    Dell, Robert F.
    Velasquez, Juan D.
    Loyola, Pablo S.
    [J]. INTELLIGENT DATA ANALYSIS, 2014, 18 (01) : 43 - 61
  • [3] Optimal Algorithms for Generation of User Session Sequences Using Server Side Web User Logs
    Arumugam, G.
    Suguna, S.
    [J]. 2009 INTERNATIONAL CONFERENCE ON NETWORK AND SERVICE SECURITY, 2009, : 151 - +
  • [4] Finding All Maximal Paths In Web User Sessions
    Bayir, Murat Ali
    Toroslu, Ismail Hakki
    [J]. PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16 COMPANION), 2016, : 15 - 16
  • [5] An evolutionary approach for clustering user access patterns from web logs
    Wu, Rui
    [J]. AI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4304 : 1184 - 1188
  • [6] Linear and sublinear time algorithms for mining frequent traversal path patterns from very large web logs
    Chen, ZX
    Fowler, RH
    Fu, AWC
    Wang, CY
    [J]. SEVENTH INTERNATIONAL DATABASE ENGINEERING AND APPLICATIONS SYMPOSIUM, PROCEEDINGS, 2003, : 117 - 122
  • [7] Mining and tracking evolving web user trends from large web server logs
    Hawwash B.
    Nasraoui O.
    [J]. Statistical Analysis and Data Mining, 2010, 3 (02): : 106 - 125
  • [8] Discovery of web frequent patterns and user characteristics from web access logs: A framework for dynamic web personalization
    Dua, S
    Cho, EC
    Iyengar, SS
    [J]. 3RD IEEE SYMPOSIUM ON APPLICATION SPECIFIC SYSTEMS AND SOFTWARE ENGINEERING TECHNOLOGY, PROCEEDINGS, 2000, : 3 - 8
  • [9] Efficient mining of temporal traversal patterns from very large Web logs
    Chen, ZX
    [J]. DMIN '05: PROCEEDINGS OF THE 2005 INTERNATIONAL CONFERENCE ON DATA MINING, 2005, : 10 - 16
  • [10] GuidedTracker: Track the Victims with Access Logs to Finding Malicious Web Pages
    Sha, Hongzhou
    Liu, Qingyun
    Zhou, Zhou
    Zheng, Chao
    [J]. 2014 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM 2014), 2014, : 564 - 569