Real-time Data Streaming using Apache Spark on Fully Configured Hadoop Cluster

被引:0
|
作者
Prasad, Kashi Sai [1 ]
Pasupathy, S. [2 ]
机构
[1] MLR Inst Technol, Dept Comp Sci & Engn, Hyderabad 500043, India
[2] Annamalai Univ, Dept Comp Sci & Engn, Annamalainagar 608002, Tamil Nadu, India
关键词
Apache Spark; BigData; Flume; Hadoop; MapReduce; Twitter data ingestion;
D O I
10.26782/jmcms.2018.12.00013
中图分类号
O3 [力学];
学科分类号
08 ; 0801 ;
摘要
Data plays a major role in today's Internet world. Analyzing historical data became easy due to advancement of analytical tools. Gathering data from social networking websites is a great challenge for today's data scientists. Many advancements and research has been conducted to gather streaming data(data generated every second). Hadoop has provided acomponent called Apache Flume to ingest data into HDFS for processing using MapReduce. It has its own benefits,which made many analysis easy for social networking data,but Apache Flume requires a depthknowledge on configuration files and administration. Our work proposes a framework for real-time data streaming of Twitter data. Apache spark which is an enhancement of Hadoop in terms of speed and faster processing provides much more insight than Apache flume. Spark is an in-memory distributed computing engine to increase processing speed over MapReduce, Spark is considered one of the most advanced ecosystem component for Batch and near-real time processing. We in our paper are explaining in detail about data ingestion using Apache Spark and Scala IDE. In our work the data will be directly ingested from Twitter website through tokens and access keys provided,which will be explained in chapter 3,4. Our GUI can also help a user to tweet into Twitter directly without moving on to Twitter website. We have also provided an option to categorize tweet of specific persons using '#' tags. The data thus obtained can be used for statistical analysis and generating reports.
引用
收藏
页码:164 / 176
页数:13
相关论文
共 50 条
  • [21] REAL-TIME INTERPOLATION OF STREAMING DATA
    Debski, Roman
    [J]. COMPUTER SCIENCE-AGH, 2020, 21 (04): : 515 - 534
  • [22] A FEATURE EXTRACTION BASED IMPROVED SENTIMENT ANALYSIS ON APACHE SPARK FOR REAL-TIME TWITTER DATA
    Kanungo, Piyush
    Singh, Hari
    [J]. SCALABLE COMPUTING-PRACTICE AND EXPERIENCE, 2023, 24 (04): : 847 - 856
  • [23] Design and Implementation of Real-Time Video Big Data Platform based on Spark Streaming
    Chen, Hongjun
    Luo, Fuqiang
    Zhao, Liheng
    Li, Yao
    [J]. INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND APPLICATION ENGINEERING (CSAE), 2017, 190 : 733 - 739
  • [24] Comparative Analysis of Apache Spark and Hadoop MapReduce Using Various Parameters and Execution Time
    Meena, Bhagavathula
    Sarwani, I. S. L.
    Archana, M.
    Supriya, P.
    [J]. INTELLIGENT COMPUTING AND COMMUNICATION, ICICC 2019, 2020, 1034 : 719 - 725
  • [25] A spark-based big data analysis framework for real-time sentiment prediction on streaming data
    Kilinc, Deniz
    [J]. SOFTWARE-PRACTICE & EXPERIENCE, 2019, 49 (09): : 1352 - 1364
  • [26] Time Estimation and Resource Minimization Scheme for Apache Spark and Hadoop Big Data Systems With Failures
    Lee, Jinbae
    Kim, Bobae
    Chung, Jong-Moon
    [J]. IEEE ACCESS, 2019, 7 : 9658 - 9666
  • [27] An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster
    Ahmed, Nasim
    Barczak, Andre L. C.
    Rashid, Mohammad A.
    Susnjak, Teo
    [J]. BIG DATA AND COGNITIVE COMPUTING, 2021, 5 (04)
  • [28] Real-Time Healthcare Monitoring System using Online Machine Learning and Spark Streaming
    Hassan, Fawzya
    Shaheen, Masoud E.
    Sahal, Radhya
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (09) : 650 - 658
  • [29] Efficient topic partitioning of Apache Kafka for high-reliability real-time data streaming applications
    Raptis, Theofanis P.
    Cicconetti, Claudio
    Passarella, Andrea
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 154 : 173 - 188
  • [30] Application of machine learning model on streaming health data event in real-time to predict health status using Spark
    Ed-daoudy, Abderrahmane
    Maalmi, Khalil
    [J]. 2018 INTERNATIONAL SYMPOSIUM ON ADVANCED ELECTRICAL AND COMMUNICATION TECHNOLOGIES (ISAECT), 2018,