Data Lake Architecture for Storing and Transforming Web Server Access Log Files

被引:6
|
作者
Zagan, Elisabeta [1 ]
Danubianu, Mirela [1 ,2 ]
机构
[1] Stefan Cel Mare Univ Suceava, Fac Elect Engn & Comp Sci, Suceava 720229, Romania
[2] Stefan Cel Mare Univ Suceava, Integrated Ctr Res Dev & Innovat Adv Mat, Nanotechnol & Distributed Syst Fabricat & Control, Suceava 720229, Romania
关键词
Computer architecture; Web servers; Big Data applications; Costs; Data models; Companies; Data mining; Cloud data lake; ADLS Gen2; data lake architecture; web server access log data; Azure function Blob trigger;
D O I
10.1109/ACCESS.2023.3270368
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Web server access log files are text files containing important data about server activities, client requests addressed to a server, server responses, etc. Large-scale analysis of these data can contribute to various improvements in different areas of interest. The main problem lies in storing these files in their raw form, over long time, to allow analysis processes to be run at any time enabling information to be extracted as foundation for high quality decisions. Our research focuses on offering an economical, secure, and high-performance solution for the storage of large amount of raw log files. Proposed system implements a Data Lake (DL) architecture in cloud using Azure Data Lake Storage Gen2 (ADLS Gen2) for extract-load-transform (ELT) pipelines. This architecture allows large volumes of data to be stored in their raw form. Afterwards they can be subjected to transformation and advanced analysis processes without the need of a structured writing scheme. The main contribution of this paper is to provide a solution that is affordable and more accessible to perform web server access log data ingestion, storage and transformation over the newest technology, Data Lake. As derivative contribution, we proposed the use of Azure Blob Trigger Function to implement the algorithm of transforming log files into parquet files leading to 90% reduction in storage space compared to their original size. That means much lower storage costs than if they had been stored as log files. A hierarchical data storage model has also been proposed for shared access to data over different layers in the DL architecture, on top of which Data Lifecycle Management (DLM) rules have been proposed for storage cost efficiency. We proposed ingesting log files into a Data Lake deployed in cloud due to ease of deployment and low storage costs. The aim is to maintain this data in the long term, to be used in future advanced analytics processes by cross-referencing with other organizational or external data. That could bring important benefits. While the proposed solution is explicitly based on ADLS Gen2, it represents an important benchmark in approaching a cloud DL solution offered by any other vendor.
引用
收藏
页码:40916 / 40929
页数:14
相关论文
共 49 条
  • [1] Analysis of web server log files and attack detection
    Faradzhullaev R.
    Automatic Control and Computer Sciences, 2008, 42 (1) : 50 - 54
  • [2] Analysis of Web Server Log Files and Attack Detection
    Faradzhullaev, R.
    AUTOMATIC CONTROL AND COMPUTER SCIENCES, 2008, 42 (01) : 50 - 54
  • [3] Analysis of web server log files and attack detection
    R. Faradzhullaev
    Automatic Control and Computer Sciences, 2008, 42 (1) : 50 - 54
  • [4] Visibility graph analysis of web server log files
    Sulaimany, Sadegh
    Mafakheri, Aso
    PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2023, 611
  • [5] Strategy of Storing and Accessing Small Web Log Files on Hadoop
    Ban, Qiucheng
    Jin, Zhengping
    PROCEEDINGS OF 2017 3RD IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS (ICCC), 2017, : 1232 - 1235
  • [6] Analysis of learning environments using Web server log files
    Xin, M
    Fisher, B
    PROCEEDINGS OF ICCE'98, VOL 2 - GLOBAL EDUCATION ON THE NET, 1998, : 222 - 229
  • [7] Query expansion using web access log files
    Zhu, Y
    Gruenwald, L
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2005, 3588 : 686 - 695
  • [8] Analyzing and Visualizing Web Server Access Log File
    Minh-Tri Nguyen
    Thanh-Dang Diep
    Tran Hoang Vinh
    Nakajima, Takuma
    Nam Thoai
    FUTURE DATA AND SECURITY ENGINEERING, FDSE 2018, 2018, 11251 : 349 - 367
  • [9] Security Incident Detection Using Multidimensional Analysis of the Web Server Log Files
    Kolaczek, Grzegorz
    Kuzemko, Tomasz
    COMPUTATIONAL COLLECTIVE INTELLIGENCE: TECHNOLOGIES AND APPLICATIONS, ICCCI 2014, 2014, 8733 : 663 - 672
  • [10] Web User Navigation Patterns Discovery from WWW Server Log Files
    Weichbroth, Pawel
    Owoc, Mieczyslaw
    Pleszkun, Michal
    2012 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS (FEDCSIS), 2012, : 1171 - 1176