Real-time Data Streaming using Apache Spark on Fully Configured Hadoop Cluster

被引:0
|
作者
Prasad, Kashi Sai [1 ]
Pasupathy, S. [2 ]
机构
[1] MLR Inst Technol, Dept Comp Sci & Engn, Hyderabad 500043, India
[2] Annamalai Univ, Dept Comp Sci & Engn, Annamalainagar 608002, Tamil Nadu, India
关键词
Apache Spark; BigData; Flume; Hadoop; MapReduce; Twitter data ingestion;
D O I
10.26782/jmcms.2018.12.00013
中图分类号
O3 [力学];
学科分类号
08 ; 0801 ;
摘要
Data plays a major role in today's Internet world. Analyzing historical data became easy due to advancement of analytical tools. Gathering data from social networking websites is a great challenge for today's data scientists. Many advancements and research has been conducted to gather streaming data(data generated every second). Hadoop has provided acomponent called Apache Flume to ingest data into HDFS for processing using MapReduce. It has its own benefits,which made many analysis easy for social networking data,but Apache Flume requires a depthknowledge on configuration files and administration. Our work proposes a framework for real-time data streaming of Twitter data. Apache spark which is an enhancement of Hadoop in terms of speed and faster processing provides much more insight than Apache flume. Spark is an in-memory distributed computing engine to increase processing speed over MapReduce, Spark is considered one of the most advanced ecosystem component for Batch and near-real time processing. We in our paper are explaining in detail about data ingestion using Apache Spark and Scala IDE. In our work the data will be directly ingested from Twitter website through tokens and access keys provided,which will be explained in chapter 3,4. Our GUI can also help a user to tweet into Twitter directly without moving on to Twitter website. We have also provided an option to categorize tweet of specific persons using '#' tags. The data thus obtained can be used for statistical analysis and generating reports.
引用
收藏
页码:164 / 176
页数:13
相关论文
共 50 条
  • [31] RT-DBSCAN: Real-Time Parallel Clustering of Spatio-Temporal Data Using Spark-Streaming
    Gong, Yikai
    Sinnott, Richard O.
    Rimba, Paul
    [J]. COMPUTATIONAL SCIENCE - ICCS 2018, PT I, 2018, 10860 : 524 - 539
  • [32] Real-Time Classification of Streaming Sensor Data
    Kasetty, Shashwati
    Stafford, Candice
    Walker, Gregory P.
    Wang, Xiaoyue
    Keogh, Eamonn
    [J]. 20TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, VOL 1, PROCEEDINGS, 2008, : 149 - +
  • [33] Real-time processing of streaming big data
    Safaei, Ali A.
    [J]. REAL-TIME SYSTEMS, 2017, 53 (01) : 1 - 44
  • [34] Real-time processing of streaming big data
    Ali A. Safaei
    [J]. Real-Time Systems, 2017, 53 : 1 - 44
  • [35] Real-time streaming of environmental field data
    Vivoni, ER
    Camilli, R
    [J]. COMPUTERS & GEOSCIENCES, 2003, 29 (04) : 457 - 468
  • [36] Low latency analytics for streaming traffic data with Apache Spark
    Maarala, Altti Ilari
    Rautiainen, Mika
    Salmi, Miikka
    Pirttikangas, Susanna
    Riekki, Jukka
    [J]. PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 2855 - 2858
  • [37] A Dynamic Data Placement Scheme for Hadoop Using Real-time Access Patterns
    Poonthottam, Viju P.
    Kumar, Madhu S. D.
    [J]. 2013 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2013, : 225 - 229
  • [38] Real-time user clickstream behavior analysis based on apache storm streaming
    Pal, Gautam
    Atkinson, Katie
    Li, Gangmin
    [J]. ELECTRONIC COMMERCE RESEARCH, 2023, 23 (03) : 1829 - 1859
  • [39] Real-time user clickstream behavior analysis based on apache storm streaming
    Gautam Pal
    Katie Atkinson
    Gangmin Li
    [J]. Electronic Commerce Research, 2023, 23 : 1829 - 1859
  • [40] Efficient Real-time Earliest Deadline First based scheduling for Apache Spark
    Neciu, Laurentiu-Florin
    Pop, Florin
    Apostol, Elena-Simona
    Truica, Ciprian-Octavian
    [J]. 2021 20TH INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED COMPUTING (ISPDC), 2021, : 97 - 104