Cloud Based Web Scraping for Big Data Applications

被引:19
|
作者
Chaulagain, Ram Sharan [1 ]
Pandey, Santosh [1 ]
Basnet, Sadhu Ram [1 ]
Shakya, Subarna [1 ]
机构
[1] Tribhuvan Univ, Inst Engn, Dept Elect & Comp Engn, Lalitpur, Nepal
关键词
Cloud Computing; Web Scraping; cloud-based web scrapper; Selenium; XPath;
D O I
10.1109/SmartCloud.2017.28
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the penetration of new technologies, there is a rapid growth of internet users and data (mostly unstructured) generated by those users on the internet. As scraping is one of the major sources for extraction of unstructured data from the Internet, we have analyzed the scraping process when introduced to a bulk of data extraction. We faced several challenges while scraping large amount of data, such as encountering captcha, storage issue for a large volume of data, need for intensive computation capacity and reliability of data extraction. In this paper, we investigate cloud-based web scraping architecture able to handle storage and computing resources with elasticity on demand using Amazon Web Services(Elastic Compute Cloud and DynamoDB). Our solution tries to address both scraping and feasibility for big data applications in a single cloud-based architecture for data-based industries. We discuss selenium as one of our tool for web scraping because of web drivers it supports which simulates a real user working with a browser. We also analyze the scalability and performance of the proposed cloud-based scrapper and describe the advantages of the proposed cloud-based scraping over other cloud-based scrapers.
引用
收藏
页码:138 / 143
页数:6
相关论文
共 50 条
  • [31] WeBrain 1.0: A Web-Based Big EEG Data Management and Cloud Computing Platform
    Zheng, Zihao
    Dong, Li
    Zhang, Yufan
    Zhou, Qiunan
    Zheng, Ting
    Zhao, Lingling
    Fan, Rui
    Li, Jianfu
    Yao, Dezhong
    [J]. INTERNATIONAL JOURNAL OF PSYCHOPHYSIOLOGY, 2021, 168 : S198 - S199
  • [32] Provisioning big data applications as services on containerised cloud: a microservices-based approach
    Gao Jing
    Li Wubin
    Zhao Zhuofeng
    Han Yanbo
    [J]. INTERNATIONAL JOURNAL OF SERVICES TECHNOLOGY AND MANAGEMENT, 2020, 26 (2-3) : 167 - 181
  • [33] Web Data Integration and Mining Based on Big Data
    Zhang, Su-Zhi
    Qu, Xu-Kai
    Sun, Jia-Bin
    [J]. INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND COMMUNICATION ENGINEERING (CSCE 2015), 2015, : 80 - 84
  • [34] A Classification Approach for Web and Cloud Based Applications
    Das, M. Swami
    Govardhan, A.
    Lakshmi, D. Vijaya
    [J]. 2016 INTERNATIONAL CONFERENCE ON ENGINEERING & MIS (ICEMIS), 2016,
  • [35] Cloud Based Big Data Analytics A Review
    Manekar, Amitkumar
    Pradeepini, G.
    [J]. 2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS (CICN), 2015, : 785 - 788
  • [36] The role of ontologies in Linked Data, Big Data and Semantic Web applications
    Bennett, Mike
    Baclawski, Kenneth
    [J]. APPLIED ONTOLOGY, 2017, 12 (3-4) : 189 - 194
  • [37] Detection of SLA Violation for Big Data Analytics Applications in Cloud
    Zeng, Xuezhi
    Garg, Saurabh
    Barika, Mutaz
    Bista, Sanat
    Puthal, Deepak
    Zomaya, Albert Y.
    Ranjan, Rajiv
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2021, 70 (05) : 746 - 758
  • [38] Optimizing Quality-Aware Big Data Applications in the Cloud
    Gianniti, Eugenio
    Ciavotta, Michele
    Ardagna, Danilo
    [J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2021, 9 (02) : 737 - 752
  • [39] Optimizing Capacity Allocation for Big Data Applications in Cloud Datacenters
    Spicuglia, Sebastiano
    Chen, Lydia Y.
    Birke, Robert
    Binder, Walter
    [J]. PROCEEDINGS OF THE 2015 IFIP/IEEE INTERNATIONAL SYMPOSIUM ON INTEGRATED NETWORK MANAGEMENT (IM), 2015, : 511 - 517
  • [40] Performance analysis model for big data applications in cloud computing
    Villalpando, Luis Eduardo Bautista
    April, Alain
    Abran, Alain
    [J]. JOURNAL OF CLOUD COMPUTING-ADVANCES SYSTEMS AND APPLICATIONS, 2014, 3