Evaluating Genomic Big Data Operations on SciDB and Spark

被引:2
|
作者
Cattani, Simone [1 ]
Ceri, Stefano [1 ]
Kaitoua, Abdulrahman [1 ]
Pinoli, Pietro [1 ]
机构
[1] Politecn Milan, Dip Elettron Informaz & Bioingn, Milan, Italy
来源
WEB ENGINEERING (ICWE 2017) | 2017年 / 10360卷
关键词
D O I
10.1007/978-3-319-60131-1_34
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We are developing a new, holistic data management system for genomics, which provides high-level abstractions for querying large genomic datasets. We designed our system so that it leverages on data management engines for low-level data access. Such design can be adapted to two different kinds of data engines: the family of scientific databases (among them, SciDB) and the broader family of generic platforms (among them, Spark). Trade-offs are not obvious; scientific databases are expected to outperform generic platforms when they use features which are embedded within their specialized design, but generic platforms are expected to outperform scientific databases on general-purpose operations. In this paper, we compare our SciDB and Spark implementations at work on genomic abstractions. We use four typical genomic operations as benchmark, stemming from the concrete requirements of our project, and encoded using SciDB and Spark; we discuss their common aspects and differences, specifically discussing how genomic regions and operations can be expressed using SciDB arrays. We comparatively evaluate the performance and scalability of the two implementations over datasets consisting of billions of genomic regions.
引用
收藏
页码:482 / 493
页数:12
相关论文
共 50 条
  • [1] Evaluating the Impact of Data Placement to Spark and SciDB with an Earth Science Use Case
    Doan, Khoa
    Oloso, Amidu O.
    Kuo, Kwo-Sen
    Clune, Thomas L.
    Yu, Hongfeng
    Nelson, Brian
    Zhang, Jian
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 341 - 346
  • [2] Alignment-free Genomic Analysis via a Big Data Spark Platform
    Petrillo, Umberto Ferraro
    Palini, Francesco
    Cattaneo, Giuseppe
    Giancarlo, Raffaele
    [J]. BIOINFORMATICS, 2021, 37 (12) : 1658 - 1665
  • [3] Conquering Big Data with Spark
    Stocia, Ion
    [J]. PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 3 - 3
  • [4] Big Data Anonymization with Spark
    Canbay, Yavuz
    Sagiroglu, Seref
    [J]. 2017 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2017, : 833 - 838
  • [5] Performance Comparison of Big Data Processing Utilizing SciDB and Apache Accumulo Databases
    Abu Mhana, Mohammad
    Khalifeh, Ala'
    Alouneh, Sahel
    [J]. 2022 SEVENTH INTERNATIONAL CONFERENCE ON FOG AND MOBILE EDGE COMPUTING, FMEC, 2022, : 17 - 21
  • [6] Big Data and Service Operations
    Cohen, Maxime C.
    [J]. PRODUCTION AND OPERATIONS MANAGEMENT, 2018, 27 (09) : 1709 - 1723
  • [7] Big data and Spark: Comparison with Hadoop
    Benlachmi, Yassine
    Hasnaoui, Moulay Lahcen
    [J]. PROCEEDINGS OF THE 2020 FOURTH WORLD CONFERENCE ON SMART TRENDS IN SYSTEMS, SECURITY AND SUSTAINABILITY (WORLDS4 2020), 2020, : 811 - 817
  • [8] Big data analytics on Apache Spark
    Salloum S.
    Dautov R.
    Chen X.
    Peng P.X.
    Huang J.Z.
    [J]. International Journal of Data Science and Analytics, 2016, 1 (3-4) : 145 - 164
  • [9] Sampling Operations on Big Data
    Gadepally, Vijay
    Herr, Taylor
    Johnson, Luke
    Milechin, Lauren
    Milosavljevic, Maja
    Miller, Benjamin A.
    [J]. 2015 49TH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS AND COMPUTERS, 2015, : 1515 - 1519
  • [10] Scalable Parallel Data Loading in SciDB
    Kim, Sangchul
    Lee, Junhee
    Kim, Taehoon
    Moon, Bongki
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 3443 - 3446