Data Lake Architecture for Storing and Transforming Web Server Access Log Files

被引:6
|
作者
Zagan, Elisabeta [1 ]
Danubianu, Mirela [1 ,2 ]
机构
[1] Stefan Cel Mare Univ Suceava, Fac Elect Engn & Comp Sci, Suceava 720229, Romania
[2] Stefan Cel Mare Univ Suceava, Integrated Ctr Res Dev & Innovat Adv Mat, Nanotechnol & Distributed Syst Fabricat & Control, Suceava 720229, Romania
关键词
Computer architecture; Web servers; Big Data applications; Costs; Data models; Companies; Data mining; Cloud data lake; ADLS Gen2; data lake architecture; web server access log data; Azure function Blob trigger;
D O I
10.1109/ACCESS.2023.3270368
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Web server access log files are text files containing important data about server activities, client requests addressed to a server, server responses, etc. Large-scale analysis of these data can contribute to various improvements in different areas of interest. The main problem lies in storing these files in their raw form, over long time, to allow analysis processes to be run at any time enabling information to be extracted as foundation for high quality decisions. Our research focuses on offering an economical, secure, and high-performance solution for the storage of large amount of raw log files. Proposed system implements a Data Lake (DL) architecture in cloud using Azure Data Lake Storage Gen2 (ADLS Gen2) for extract-load-transform (ELT) pipelines. This architecture allows large volumes of data to be stored in their raw form. Afterwards they can be subjected to transformation and advanced analysis processes without the need of a structured writing scheme. The main contribution of this paper is to provide a solution that is affordable and more accessible to perform web server access log data ingestion, storage and transformation over the newest technology, Data Lake. As derivative contribution, we proposed the use of Azure Blob Trigger Function to implement the algorithm of transforming log files into parquet files leading to 90% reduction in storage space compared to their original size. That means much lower storage costs than if they had been stored as log files. A hierarchical data storage model has also been proposed for shared access to data over different layers in the DL architecture, on top of which Data Lifecycle Management (DLM) rules have been proposed for storage cost efficiency. We proposed ingesting log files into a Data Lake deployed in cloud due to ease of deployment and low storage costs. The aim is to maintain this data in the long term, to be used in future advanced analytics processes by cross-referencing with other organizational or external data. That could bring important benefits. While the proposed solution is explicitly based on ADLS Gen2, it represents an important benchmark in approaching a cloud DL solution offered by any other vendor.
引用
收藏
页码:40916 / 40929
页数:14
相关论文
共 49 条
  • [41] Web log session analyzer: Integrating parsing and logic programming into a data mart architecture
    Desmarais, MC
    2005 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, PROCEEDINGS, 2005, : 794 - 797
  • [42] An Approach To Build Sequence Database From Web Log Data For Webpage Access Prediction
    Nguyen Thon Da
    Tan Hanh
    Pham Hoang Duy
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2018, 18 (02): : 138 - 143
  • [43] Software Service Architecture to Access Weather Data Using RESTful Web Services
    Ramanathan, Ramakrishnan
    Korte, Thomas
    2014 INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT, 2014,
  • [44] Getting to the Source: a Survey of Quantitative Data Sources Available to the Everyday Librarian: Part 1: Web Server Log Analysis
    Goddard, Lisa
    EVIDENCE BASED LIBRARY AND INFORMATION PRACTICE, 2007, 2 (01): : 48 - 67
  • [45] A fuzzy neural network based framework to discover user access patterns from web log data
    Ansari, Zahid A.
    Sattar, Syed Abdul
    Babu, A. Vinaya
    ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2017, 11 (03) : 519 - 546
  • [46] A fuzzy neural network based framework to discover user access patterns from web log data
    Zahid A. Ansari
    Syed Abdul Sattar
    A. Vinaya Babu
    Advances in Data Analysis and Classification, 2017, 11 : 519 - 546
  • [47] Stress Testing Data Access via a Web Service for Determination of Adequate Server Hardware for Developed Software Solution
    Motalova, Leona
    Krejcar, Ondrej
    2010 SECOND INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND APPLICATIONS: ICCEA 2010, PROCEEDINGS, VOL 1, 2010, : 329 - 333
  • [48] Text analytics and data access as services - A case study in transforming a legacy client-server text analytics workbench and framework to SOA
    Maximilien, E. Michael
    Chen, Ying
    Lelescu, Ana
    Rhodes, James
    Kreulen, Jeffrey
    Spangler, Scott
    ICEIS 2007: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS: DATABASES AND INFORMATION SYSTEMS INTEGRATION, 2007, : 581 - 588
  • [49] Geospatial web services pave new ways for server-based on-demand access and processing of Big Earth Data
    Wagemann, Julia
    Clements, Oliver
    Figuera, Ramiro Marco
    Rossi, Angelo Pio
    Mantovani, Simone
    INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2018, 11 (01) : 7 - 25