Data Lake Architecture for Storing and Transforming Web Server Access Log Files

被引:6
|
作者
Zagan, Elisabeta [1 ]
Danubianu, Mirela [1 ,2 ]
机构
[1] Stefan Cel Mare Univ Suceava, Fac Elect Engn & Comp Sci, Suceava 720229, Romania
[2] Stefan Cel Mare Univ Suceava, Integrated Ctr Res Dev & Innovat Adv Mat, Nanotechnol & Distributed Syst Fabricat & Control, Suceava 720229, Romania
关键词
Computer architecture; Web servers; Big Data applications; Costs; Data models; Companies; Data mining; Cloud data lake; ADLS Gen2; data lake architecture; web server access log data; Azure function Blob trigger;
D O I
10.1109/ACCESS.2023.3270368
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Web server access log files are text files containing important data about server activities, client requests addressed to a server, server responses, etc. Large-scale analysis of these data can contribute to various improvements in different areas of interest. The main problem lies in storing these files in their raw form, over long time, to allow analysis processes to be run at any time enabling information to be extracted as foundation for high quality decisions. Our research focuses on offering an economical, secure, and high-performance solution for the storage of large amount of raw log files. Proposed system implements a Data Lake (DL) architecture in cloud using Azure Data Lake Storage Gen2 (ADLS Gen2) for extract-load-transform (ELT) pipelines. This architecture allows large volumes of data to be stored in their raw form. Afterwards they can be subjected to transformation and advanced analysis processes without the need of a structured writing scheme. The main contribution of this paper is to provide a solution that is affordable and more accessible to perform web server access log data ingestion, storage and transformation over the newest technology, Data Lake. As derivative contribution, we proposed the use of Azure Blob Trigger Function to implement the algorithm of transforming log files into parquet files leading to 90% reduction in storage space compared to their original size. That means much lower storage costs than if they had been stored as log files. A hierarchical data storage model has also been proposed for shared access to data over different layers in the DL architecture, on top of which Data Lifecycle Management (DLM) rules have been proposed for storage cost efficiency. We proposed ingesting log files into a Data Lake deployed in cloud due to ease of deployment and low storage costs. The aim is to maintain this data in the long term, to be used in future advanced analytics processes by cross-referencing with other organizational or external data. That could bring important benefits. While the proposed solution is explicitly based on ADLS Gen2, it represents an important benchmark in approaching a cloud DL solution offered by any other vendor.
引用
收藏
页码:40916 / 40929
页数:14
相关论文
共 49 条
  • [31] Realization of web heterogeneous data integration access architecture
    Li, Guanyu
    Qu, Lining
    Wu, Dandan
    2007 INTERNATIONAL SYMPOSIUM ON COMPUTER SCIENCE & TECHNOLOGY, PROCEEDINGS, 2007, : 1044 - 1046
  • [32] An Effective Hierarchical Data Extraction Method for Mobile Web Access Log
    Gao, Li-ping
    Gao, Li-feng
    Qin, Xiao-min
    2015 INTERNATIONAL CONFERENCE ON MATERIALS AND ENGINEERING AND INDUSTRIAL APPLICATIONS (MEIA 2015), 2015, : 331 - 336
  • [33] Automatic discovery of the sequential accesses from web log data files via a genetic algorithm
    Tug, Emine
    Sakiroglu, Merve
    Arslan, Ahmet
    KNOWLEDGE-BASED SYSTEMS, 2006, 19 (03) : 180 - 186
  • [34] Understanding Academic Information Seeking Habits through Analysis of Web Server Log Files: The Case of the Teachers College Library Website
    Asunka, Stephen
    Chae, Hui Soo
    Hughes, Brian
    Natriello, Gary
    JOURNAL OF ACADEMIC LIBRARIANSHIP, 2009, 35 (01): : 33 - 45
  • [35] Multi-layer Architecture For Storing Visual Data Based on WCF and Microsoft SQL Server Database
    Grycuk, Rafal
    Gabryel, Marcin
    Scherer, Rafal
    Voloshynovskiy, Sviatoslav
    ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, PT I, 2015, 9119 : 715 - 726
  • [36] The live access server and DODS: Web visualization and data fusion for distributed holdings
    Hankin, S
    Callahan, J
    Sirott, J
    17TH INTERNATIONAL CONFERENCE ON INTERACTIVE INFORMATION AND PROCESSING SYSTEMS (IIPS) FOR METEOROLOGY, OCEANOGRAPHY, AND HYDROLOGY, 2001, : 380 - 382
  • [37] A Novel Architecture for Search Engine using Domain Based Web Log Data
    Sharma, Prem
    Yadav, Divakar
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2023, 20 (01) : 92 - 101
  • [38] On the validity of client-side vs server-side web log data analysis
    Yun, Gi Woong
    Ford, Jay
    Hawkins, Robert P.
    Pingree, Suzanne
    McTavish, Fiona
    Gustafson, David
    Berhe, Haile
    INTERNET RESEARCH, 2006, 16 (05) : 537 - 552
  • [40] Extensible embedded web server architecture for Internet-based data acquisition and control
    Klimchynski, Igor
    IEEE SENSORS JOURNAL, 2006, 6 (03) : 804 - 811