Distributed real-time ETL architecture for unstructured big data

被引:5
|
作者
Mehmood, Erum [1 ]
Anees, Tayyaba [1 ]
机构
[1] Univ Management & Technol, Sch Syst & Technol, Lahore 54770, Punjab, Pakistan
关键词
Real-time stream; Distributed big data; Spark Structured Streaming; Apache Kafka; MongoDB; Semi-structured; Unstructured data; ETL; Data pipeline; STREAM; FRAMEWORK; MANAGEMENT;
D O I
10.1007/s10115-022-01757-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Real-time extract transform load (ETL) is the integral part of increasing demand of faster business decisions targeting large number of modern applications. Multi-source unstructured data stream extraction and transformation using disk data in distributed environment are the building blocks of real-time ETL due to volume and velocity of data. Therefore designing an architecture for basic building blocks for real-time ETL remains a major challenge. In this paper, we focus primarily to expedite stream-disk joins during transformation phase of ETL that is considered most expensive operator in stream processing due to frequent disk access. We propose an architecture for real-time ETL to ingest unstructured stream of data from multi-sources, without having to worry about the structure of data sources, and transform them after joining with distributed disk data. We also present a novel data pipeline stream-disk join that uses partition-based input and best-effort in-memory database technique reducing frequent disk access. The proposed architecture addresses the challenges of stream data loss, ignored un-matching streams, disk overhead and real-time processing for distributed environment. The experimental results obtained using stream generator and real-world datasets on local and distributed machines show that proposed architecture yields significantly improved throughput especially for large number of stream tuples with large datasets.
引用
收藏
页码:3419 / 3445
页数:27
相关论文
共 50 条
  • [1] Distributed real-time ETL architecture for unstructured big data
    Erum Mehmood
    Tayyaba Anees
    [J]. Knowledge and Information Systems, 2022, 64 : 3419 - 3445
  • [2] Real-Time Data ETL Framework for Big Real-Time Data Analysis
    Li, Xiaofang
    Mao, Yingchi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION, 2015, : 1289 - 1294
  • [3] AScale: Big/Small Data ETL and Real-Time Data Freshness
    Martins, Pedro
    Abbasi, Maryam
    Furtado, Pedro
    [J]. BEYOND DATABASES, ARCHITECTURES AND STRUCTURES, BDAS 2016, 2016, 613 : 315 - 327
  • [4] RUBA: Real-time Unstructured Big Data Analysis Framework
    Kim, Jaein
    Kim, Nacwoo
    Lee, Byungtak
    Park, Joonho
    Seo, Kwangik
    Park, Hunyoung
    [J]. 2013 INTERNATIONAL CONFERENCE ON ICT CONVERGENCE (ICTC 2013): FUTURE CREATIVE CONVERGENCE TECHNOLOGIES FOR NEW ICT ECOSYSTEMS, 2013, : 520 - 524
  • [5] Real-Time Big Data Analysis Architecture and Application
    Sharma, Nandani
    Agarwal, Manisha
    [J]. DATA SCIENCE AND BIG DATA ANALYTICS, 2019, 16 : 313 - 320
  • [6] HBelt: Integrating an Incremental ETL Pipeline with a Big Data Store for Real-Time Analytics
    Qu, Weiping
    Shankar, Sahana
    Ganza, Sandy
    Dessloch, Stefan
    [J]. ADVANCES IN DATABASES AND INFORMATION SYSTEMS, ADBIS 2015, 2015, 9282 : 123 - 137
  • [7] An ETL Strategy for Real-Time Data Warehouse
    Zhou, Haihe
    Yang, Dingyu
    Xu, Yang
    [J]. PRACTICAL APPLICATIONS OF INTELLIGENT SYSTEMS, 2011, 124 : 329 - +
  • [8] The SOLID architecture for real-time management of big semantic data
    Martinez-Prieto, Miguel A.
    Cuesta, Carlos E.
    Arias, Mario
    Fernandez, Javier D.
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2015, 47 : 62 - 79
  • [9] Big Data Analytics Architecture for Real-Time Traffic Control
    Amini, Sasan
    Gerostathopoulos, Ilias
    Prehofer, Christian
    [J]. 2017 5TH IEEE INTERNATIONAL CONFERENCE ON MODELS AND TECHNOLOGIES FOR INTELLIGENT TRANSPORTATION SYSTEMS (MT-ITS), 2017, : 710 - 715
  • [10] A Big Data Architecture for Near Real-time Traffic Analytics
    Gong, Yikai
    Rimba, Paul
    Sinnott, Richard O.
    [J]. COMPANION PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON UTILITY AND CLOUD COMPUTING (UCC'17 COMPANION), 2017, : 157 - 162