Data Lake Architecture for Storing and Transforming Web Server Access Log Files

被引:6
|
作者
Zagan, Elisabeta [1 ]
Danubianu, Mirela [1 ,2 ]
机构
[1] Stefan Cel Mare Univ Suceava, Fac Elect Engn & Comp Sci, Suceava 720229, Romania
[2] Stefan Cel Mare Univ Suceava, Integrated Ctr Res Dev & Innovat Adv Mat, Nanotechnol & Distributed Syst Fabricat & Control, Suceava 720229, Romania
关键词
Computer architecture; Web servers; Big Data applications; Costs; Data models; Companies; Data mining; Cloud data lake; ADLS Gen2; data lake architecture; web server access log data; Azure function Blob trigger;
D O I
10.1109/ACCESS.2023.3270368
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Web server access log files are text files containing important data about server activities, client requests addressed to a server, server responses, etc. Large-scale analysis of these data can contribute to various improvements in different areas of interest. The main problem lies in storing these files in their raw form, over long time, to allow analysis processes to be run at any time enabling information to be extracted as foundation for high quality decisions. Our research focuses on offering an economical, secure, and high-performance solution for the storage of large amount of raw log files. Proposed system implements a Data Lake (DL) architecture in cloud using Azure Data Lake Storage Gen2 (ADLS Gen2) for extract-load-transform (ELT) pipelines. This architecture allows large volumes of data to be stored in their raw form. Afterwards they can be subjected to transformation and advanced analysis processes without the need of a structured writing scheme. The main contribution of this paper is to provide a solution that is affordable and more accessible to perform web server access log data ingestion, storage and transformation over the newest technology, Data Lake. As derivative contribution, we proposed the use of Azure Blob Trigger Function to implement the algorithm of transforming log files into parquet files leading to 90% reduction in storage space compared to their original size. That means much lower storage costs than if they had been stored as log files. A hierarchical data storage model has also been proposed for shared access to data over different layers in the DL architecture, on top of which Data Lifecycle Management (DLM) rules have been proposed for storage cost efficiency. We proposed ingesting log files into a Data Lake deployed in cloud due to ease of deployment and low storage costs. The aim is to maintain this data in the long term, to be used in future advanced analytics processes by cross-referencing with other organizational or external data. That could bring important benefits. While the proposed solution is explicitly based on ADLS Gen2, it represents an important benchmark in approaching a cloud DL solution offered by any other vendor.
引用
收藏
页码:40916 / 40929
页数:14
相关论文
共 49 条
  • [21] Scientific LogAnalyzer: A Web-based tool for analyses of server log files in psychological research
    Ulf-Dietrich Reips
    Stefan Stieger
    Behavior Research Methods, Instruments, & Computers, 2004, 36 : 304 - 311
  • [22] Data Mining Algorithms for Knowledge Extraction from Web Log Files
    El Alami, Anass Abdelhamid
    Ezzikouri, Hanane
    Erritali, Mohammed
    ADVANCED INTELLIGENT SYSTEMS FOR SUSTAINABLE DEVELOPMENT (AI2SD'2019): VOL 1 - ADVANCED INTELLIGENT SYSTEMS FOR EDUCATION AND INTELLIGENT LEARNING SYSTEM, 2020, 1102 : 118 - 128
  • [23] Access Control Architecture Separating Privilege by a Thread on a Web Server
    Matsumoto, Ryosuke
    Okabe, Yasuo
    2012 IEEE/IPSJ 12TH INTERNATIONAL SYMPOSIUM ON APPLICATIONS AND THE INTERNET (SAINT), 2012, : 178 - 183
  • [24] Predicting Web users' next access based on log data
    Sen, R
    Hansen, MH
    JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2003, 12 (01) : 143 - 155
  • [25] A study on the mining access patterns from Web log data
    Ahn, JY
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2002, E85D (04): : 782 - 785
  • [26] A study on the mining access patterns from web log data
    Ahn, Jeong Yong
    IEICE Transactions on Information and Systems, 2002, E85-D (04) : 782 - 785
  • [27] Log summarizing agent for web access data using data mining techniques
    Kato, H
    Hiraishi, H
    Mizoguchi, F
    JOINT 9TH IFSA WORLD CONGRESS AND 20TH NAFIPS INTERNATIONAL CONFERENCE, PROCEEDINGS, VOLS. 1-5, 2001, : 2642 - 2647
  • [28] Log files analysis to assess the use and workload of a dynamic web server dedicated to End-Stage Renal Disease
    Ben Said, Mohamed
    Le Mignot, Loic
    Richard, Jean Baptiste
    Le Bihan, Christine
    Toubiana, Laurent
    Jais, Jean-Philippe
    Landais, Paul
    UBIQUITY: TECHNOLOGIES FOR BETTER HEALTH IN AGING SOCIETIES, 2006, 124 : 277 - 282
  • [29] Detecting Web Crawlers from Web Server Access Logs with Data Mining Classifiers
    Stevanovic, Dusan
    An, Aijun
    Vlajic, Natalija
    FOUNDATIONS OF INTELLIGENT SYSTEMS, 2011, 6804 : 483 - 489
  • [30] A Preliminary Analysis of Web Usage Behaviors from Web Access Log Files A Case Study of Prince of Songkla University, Thailand
    Wongsirichot, Thakerng
    Sukpisit, Sukgamon
    Hanghu, Warakorn
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON SOFT COMPUTING TECHNIQUES AND ENGINEERING APPLICATION, ICSCTEA 2013, 2014, 250 : 325 - 332