Evaluating Genomic Big Data Operations on SciDB and Spark

被引：2

作者：

Cattani, Simone ^{[1
]}

Ceri, Stefano ^{[1
]}

Kaitoua, Abdulrahman ^{[1
]}

Pinoli, Pietro ^{[1
]}

机构：

[1] Politecn Milan, Dip Elettron Informaz & Bioingn, Milan, Italy

来源：

WEB ENGINEERING (ICWE 2017) | 2017年 / 10360卷

关键词：

D O I：

10.1007/978-3-319-60131-1_34

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We are developing a new, holistic data management system for genomics, which provides high-level abstractions for querying large genomic datasets. We designed our system so that it leverages on data management engines for low-level data access. Such design can be adapted to two different kinds of data engines: the family of scientific databases (among them, SciDB) and the broader family of generic platforms (among them, Spark). Trade-offs are not obvious; scientific databases are expected to outperform generic platforms when they use features which are embedded within their specialized design, but generic platforms are expected to outperform scientific databases on general-purpose operations. In this paper, we compare our SciDB and Spark implementations at work on genomic abstractions. We use four typical genomic operations as benchmark, stemming from the concrete requirements of our project, and encoded using SciDB and Spark; we discuss their common aspects and differences, specifically discussing how genomic regions and operations can be expressed using SciDB arrays. We comparatively evaluate the performance and scalability of the two implementations over datasets consisting of billions of genomic regions.

引用

页码：482 / 493

页数：12

共 50 条

[1] Evaluating the Impact of Data Placement to Spark and SciDB with an Earth Science Use Case
Doan, Khoa
Oloso, Amidu O.
Kuo, Kwo-Sen
Clune, Thomas L.
Yu, Hongfeng
Nelson, Brian
Zhang, Jian
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 341 - 346
[2] Alignment-free Genomic Analysis via a Big Data Spark Platform
Petrillo, Umberto Ferraro
Palini, Francesco
Cattaneo, Giuseppe
Giancarlo, Raffaele
[J]. BIOINFORMATICS, 2021, 37 (12) : 1658 - 1665
[3] Conquering Big Data with Spark
Stocia, Ion
[J]. PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 3 - 3
[4] Big Data Anonymization with Spark
Canbay, Yavuz
Sagiroglu, Seref
[J]. 2017 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2017, : 833 - 838
[5] Performance Comparison of Big Data Processing Utilizing SciDB and Apache Accumulo Databases
Abu Mhana, Mohammad
Khalifeh, Ala'
Alouneh, Sahel
[J]. 2022 SEVENTH INTERNATIONAL CONFERENCE ON FOG AND MOBILE EDGE COMPUTING, FMEC, 2022, : 17 - 21
[6] Big Data and Service Operations
Cohen, Maxime C.
[J]. PRODUCTION AND OPERATIONS MANAGEMENT, 2018, 27 (09) : 1709 - 1723
[7] Big data and Spark: Comparison with Hadoop
Benlachmi, Yassine
Hasnaoui, Moulay Lahcen
[J]. PROCEEDINGS OF THE 2020 FOURTH WORLD CONFERENCE ON SMART TRENDS IN SYSTEMS, SECURITY AND SUSTAINABILITY (WORLDS4 2020), 2020, : 811 - 817
[8] Big data analytics on Apache Spark
Salloum S.
Dautov R.
Chen X.
Peng P.X.
Huang J.Z.
[J]. International Journal of Data Science and Analytics, 2016, 1 (3-4) : 145 - 164
[9] Sampling Operations on Big Data
Gadepally, Vijay
Herr, Taylor
Johnson, Luke
Milechin, Lauren
Milosavljevic, Maja
Miller, Benjamin A.
[J]. 2015 49TH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS AND COMPUTERS, 2015, : 1515 - 1519
[10] Scalable Parallel Data Loading in SciDB
Kim, Sangchul
Lee, Junhee
Kim, Taehoon
Moon, Bongki
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 3443 - 3446

← 1 2 3 4 5 →