Cloud Based Web Scraping for Big Data Applications

被引:19
|
作者
Chaulagain, Ram Sharan [1 ]
Pandey, Santosh [1 ]
Basnet, Sadhu Ram [1 ]
Shakya, Subarna [1 ]
机构
[1] Tribhuvan Univ, Inst Engn, Dept Elect & Comp Engn, Lalitpur, Nepal
关键词
Cloud Computing; Web Scraping; cloud-based web scrapper; Selenium; XPath;
D O I
10.1109/SmartCloud.2017.28
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the penetration of new technologies, there is a rapid growth of internet users and data (mostly unstructured) generated by those users on the internet. As scraping is one of the major sources for extraction of unstructured data from the Internet, we have analyzed the scraping process when introduced to a bulk of data extraction. We faced several challenges while scraping large amount of data, such as encountering captcha, storage issue for a large volume of data, need for intensive computation capacity and reliability of data extraction. In this paper, we investigate cloud-based web scraping architecture able to handle storage and computing resources with elasticity on demand using Amazon Web Services(Elastic Compute Cloud and DynamoDB). Our solution tries to address both scraping and feasibility for big data applications in a single cloud-based architecture for data-based industries. We discuss selenium as one of our tool for web scraping because of web drivers it supports which simulates a real user working with a browser. We also analyze the scalability and performance of the proposed cloud-based scrapper and describe the advantages of the proposed cloud-based scraping over other cloud-based scrapers.
引用
收藏
页码:138 / 143
页数:6
相关论文
共 50 条
  • [1] Performance Prediction of Cloud-Based Big Data Applications
    Ardagna, Danilo
    Barbierato, Enrico
    Evangelinou, Athanasia
    Gianniti, Eugenio
    Gribaudo, Marco
    Pinto, Tulio B. M.
    Guimaraes, Anna
    da Silva, Ana Paula Couto
    Almeida, Jussara M.
    [J]. PROCEEDINGS OF THE 2018 ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING (ICPE '18), 2018, : 192 - 199
  • [2] Implementation of Cloud Computing and Big Data with Java']Java Based Web Application
    Saxena, Ankur
    Kaushik, Neeraj
    Kaushik, Nidhi
    Dwivedi, Asit
    [J]. PROCEEDINGS OF THE 10TH INDIACOM - 2016 3RD INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT, 2016, : 1289 - 1293
  • [3] A Performance Analysis of MapReduce Applications on Big Data in Cloud based Hadoop
    Gohil, Parth
    Garg, Dweepna
    Panchal, Bakul
    [J]. 2014 INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND EMBEDDED SYSTEMS (ICICES), 2014,
  • [4] Scheduling of big data applications on distributed cloud based on QoS parameters
    Sandhu, Rajinder
    Sood, Sandeep K.
    [J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2015, 18 (02): : 817 - 828
  • [5] Scheduling of big data applications on distributed cloud based on QoS parameters
    Rajinder Sandhu
    Sandeep K. Sood
    [J]. Cluster Computing, 2015, 18 : 817 - 828
  • [6] APPLICATIONS OF BIG DATA IN RENEWABLE ENERGY SYSTEMS BASED ON CLOUD COMPUTING
    Sreedhar, Tarun Shakthi
    Islam, Saiful
    Atmosa, Meron
    Yazdandoust, Elaheh
    Elnaim, Mohamed Suliman
    Mishra, Shomesh
    Naresh, Venkata
    Bajpai, Vemparala Rupali
    [J]. INTERNATIONAL JOURNAL ON INFORMATION TECHNOLOGIES AND SECURITY, 2024, 16 (03): : 121 - 128
  • [7] Predicting the performance of big data applications on the cloud
    Ardagna, D.
    Barbierato, E.
    Gianniti, E.
    Gribaudo, M.
    Pinto, T. B. M.
    da Silva, A. P. C.
    Almeida, J. M.
    [J]. JOURNAL OF SUPERCOMPUTING, 2021, 77 (02): : 1321 - 1353
  • [8] Capacity Allocation for Big Data Applications in the Cloud
    Ciavotta, Michele
    Gianniti, Eugenio
    Ardagna, Danilo
    [J]. ICPE'17: COMPANION OF THE 2017 ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING, 2017, : 175 - 176
  • [9] A Cloud Reservation System for Big Data Applications
    Marinescu, Dan C.
    Paya, Ashkan
    Morrison, John P.
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (03) : 606 - 618
  • [10] Cloud computing and big data: Technologies and applications
    Zbakh, Mostapha
    Bakhouya, Mohamed
    Essaaidi, Mohamed
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (11):