Hadoop and Spark for Data Management, Processing and Analysis of Astronomical Big Data: Applicability and Performance

被引:0
|
作者
Harischandra, Lloyd [1 ]
机构
[1] Australian Astron Observ, POB 915, N Ryde, NSW 1670, Australia
关键词
D O I
暂无
中图分类号
P1 [天文学];
学科分类号
0704 ;
摘要
The AAT node of the All Sky Virtual Observatory (ASVO) is being built on top of Apache Hadoop and Apache Spark technologies. The Hadoop Distributed File System (HDFS) is used as the data store and Apache Spark is used as the data processing engine. The data store consists of a cluster of 4 nodes of which 3 nodes provide space for data storage and all 4 nodes can be used to gain computing power. In this paper, we compare the performance of Apache Spark on GAMA data hosted on HDFS against other relational database management systems and software in the fields of data management, processing and analysis of astronomical Big Data. We examine the usability, flexibility and extensibility of the libraries and languages available within Spark, specifically in querying and processing large amounts of heterogeneous astronomical data. The data included are primarily in tabular format but we discuss how we can leverage the rich functionalities offered by Hadoop and Spark libraries to store, process/transform and query data in other formats such as HDF5 and FITS. We will also discuss the limitations of existing relational database management systems in terms of scalability and usability. Then we evaluate the benchmark results of varying data import and transform scenarios, and the expected latency of queries across a range of complexities. Lastly, we will show how astronomers can create custom data-processing tasks in their preferred language (python, R etc.) using Spark, with limited knowledge of the Hadoop technologies.
引用
收藏
页码:41 / 44
页数:4
相关论文
共 50 条
  • [1] Big Data Management Processing with Hadoop MapReduce and Spark Technology: A Comparison
    Verma, Ankush
    Mansuri, Ashik Hussain
    Jain, Neelesh
    [J]. 2016 SYMPOSIUM ON COLOSSAL DATA ANALYSIS AND NETWORKING (CDAN), 2016,
  • [2] Big Data Processing Using Hadoop and Spark: The Case of Meteorology Data
    Hussein, Eslam
    Sadiki, Ronewa
    Jafta, Yahlieel
    Sungay, Muhammad Mujahid
    Ajayi, Olasupo
    Bagula, Antoine
    [J]. E-INFRASTRUCTURE AND E-SERVICES FOR DEVELOPING COUNTRIES (AFRICOMM 2019), 2020, 311 : 180 - 185
  • [3] Big data and Spark: Comparison with Hadoop
    Benlachmi, Yassine
    Hasnaoui, Moulay Lahcen
    [J]. PROCEEDINGS OF THE 2020 FOURTH WORLD CONFERENCE ON SMART TRENDS IN SYSTEMS, SECURITY AND SUSTAINABILITY (WORLDS4 2020), 2020, : 811 - 817
  • [4] A Comparative Study of Big Data Processing : Hadoop vs. Spark
    Sharma, Meghna
    Kaur, Jagdeep
    [J]. PROCEEDINGS OF THE 2019 6TH INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT (INDIACOM), 2019, : 1073 - 1077
  • [5] Performance Modeling and Analysis of a Hadoop Cluster for Efficient Big Data Processing
    Lim, JongBeom
    Ahnh, Jong-Suk
    Lee, Kang-Woo
    [J]. ADVANCED SCIENCE LETTERS, 2016, 22 (09) : 2314 - 2319
  • [6] A Comparison of Big Remote Sensing Data Processing with Hadoop MapReduce and Spark
    Chebbi, I.
    Boulila, W.
    Mellouli, N.
    Lamolle, M.
    Farah, I. R.
    [J]. 2018 4TH INTERNATIONAL CONFERENCE ON ADVANCED TECHNOLOGIES FOR SIGNAL AND IMAGE PROCESSING (ATSIP), 2018,
  • [7] ASTROIDE: A Unified Astronomical Big Data Processing Engine over Spark
    Brahem, Mariem
    Zeitouni, Karine
    Yeh, Laurent
    [J]. IEEE TRANSACTIONS ON BIG DATA, 2020, 6 (03) : 477 - 491
  • [8] Electricity Production Data Processing and Management Based on Hadoop and Spark
    Wang, Jun
    Han, Lin-feng
    Hou, Bin
    [J]. INTERNATIONAL CONFERENCE ON CONTROL AND AUTOMATION (ICCA 2016), 2016, : 177 - 181
  • [9] Performance Analysis of Distributed Computing Frameworks for Big Data Analytics: Hadoop Vs Spark
    Ketu, Shwet
    Mishra, Pramod Kumar
    Agarwal, Sonali
    [J]. COMPUTACION Y SISTEMAS, 2020, 24 (02): : 669 - 686
  • [10] Big Data Management Performance Evaluation in Hadoop Ecosystem
    Liu, Qing
    Fu, Yinjin
    Ni, Guiqiang
    Mei, Jianmin
    [J]. 2017 3RD INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING AND COMMUNICATIONS (BIGCOM), 2017, : 413 - 421