Real-time Data Streaming using Apache Spark on Fully Configured Hadoop Cluster

被引:0
|
作者
Prasad, Kashi Sai [1 ]
Pasupathy, S. [2 ]
机构
[1] MLR Inst Technol, Dept Comp Sci & Engn, Hyderabad 500043, India
[2] Annamalai Univ, Dept Comp Sci & Engn, Annamalainagar 608002, Tamil Nadu, India
关键词
Apache Spark; BigData; Flume; Hadoop; MapReduce; Twitter data ingestion;
D O I
10.26782/jmcms.2018.12.00013
中图分类号
O3 [力学];
学科分类号
08 ; 0801 ;
摘要
Data plays a major role in today's Internet world. Analyzing historical data became easy due to advancement of analytical tools. Gathering data from social networking websites is a great challenge for today's data scientists. Many advancements and research has been conducted to gather streaming data(data generated every second). Hadoop has provided acomponent called Apache Flume to ingest data into HDFS for processing using MapReduce. It has its own benefits,which made many analysis easy for social networking data,but Apache Flume requires a depthknowledge on configuration files and administration. Our work proposes a framework for real-time data streaming of Twitter data. Apache spark which is an enhancement of Hadoop in terms of speed and faster processing provides much more insight than Apache flume. Spark is an in-memory distributed computing engine to increase processing speed over MapReduce, Spark is considered one of the most advanced ecosystem component for Batch and near-real time processing. We in our paper are explaining in detail about data ingestion using Apache Spark and Scala IDE. In our work the data will be directly ingested from Twitter website through tokens and access keys provided,which will be explained in chapter 3,4. Our GUI can also help a user to tweet into Twitter directly without moving on to Twitter website. We have also provided an option to categorize tweet of specific persons using '#' tags. The data thus obtained can be used for statistical analysis and generating reports.
引用
收藏
页码:164 / 176
页数:13
相关论文
共 50 条
  • [1] Real-Time Heart Arrhythmia Detection Using Apache Spark Structured Streaming
    Ilbeigipour, Sadegh
    Albadvi, Amir
    Akhondzadeh Noughabi, Elham
    [J]. JOURNAL OF HEALTHCARE ENGINEERING, 2021, 2021
  • [2] Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark
    Armbrust, Michael
    Das, Tathagata
    Torres, Joseph
    Yavuz, Burak
    Zhu, Shixiong
    Xin, Reynold
    Ghodsi, Ali
    Stoica, Ion
    Zaharia, Matei
    [J]. SIGMOD'18: PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2018, : 601 - 613
  • [3] Real-time Pattern Detection in IP Flow Data using Apache Spark
    Cermak, Milan
    Lastovicka, Martin
    Jirsik, Tomas
    [J]. 2019 IFIP/IEEE SYMPOSIUM ON INTEGRATED NETWORK AND SERVICE MANAGEMENT (IM), 2019, : 521 - 526
  • [4] KORDI: A Framework for Real-Time Performance and Cost Optimization of Apache Spark Streaming
    Kordelas, Athanasios
    Spyrou, Thanasis
    Voulgaris, Spyros
    Megalooikonomou, Vasileios
    Deligiannis, Nikos
    [J]. 2023 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE, ISPASS, 2023, : 337 - 339
  • [5] Real-time Processing of IoT Events with Historic data using Apache Kafka and Apache Spark with Dashing framework
    D'silva, Godson Michael
    Khan, Azharuddin
    Joshi, Gaurav
    SiddheshBari
    [J]. 2017 2ND IEEE INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ELECTRONICS, INFORMATION & COMMUNICATION TECHNOLOGY (RTEICT), 2017, : 1804 - 1809
  • [6] Real-time incremental recommendation for streaming data based on apache flink
    Tang, Zhuo
    Liu, Zeyu
    Li, Kenli
    Li, Keqin
    [J]. INTELLIGENT DATA ANALYSIS, 2019, 23 (06) : 1421 - 1437
  • [7] Real-Time Regex Matching With Apache Spark
    Deaton, Sean
    Brownfield, David
    Kosta, Leonard
    Zhu, Zhaozhong
    Matthews, Suzanne J.
    [J]. 2017 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2017,
  • [8] Real-Time Big Data Stream Processing Using GPU with Spark Over Hadoop Ecosystem
    Rathore, M. Mazhar
    Son, Hojae
    Ahmad, Awais
    Paul, Anand
    Jeon, Gwanggil
    [J]. INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2018, 46 (03) : 630 - 646
  • [9] Real-time Analysis of NetFlow Data for Generating Network Traffic Statistics using Apache Spark
    Cermak, Milan
    Jirsik, Tomas
    Lastovicka, Martin
    [J]. NOMS 2016 - 2016 IEEE/IFIP NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM, 2016, : 1019 - 1020
  • [10] Real-Time Big Data Stream Processing Using GPU with Spark Over Hadoop Ecosystem
    M. Mazhar Rathore
    Hojae Son
    Awais Ahmad
    Anand Paul
    Gwanggil Jeon
    [J]. International Journal of Parallel Programming, 2018, 46 : 630 - 646