Alignment-free Genomic Analysis via a Big Data Spark Platform

被引：6

作者：

Petrillo, Umberto Ferraro ^{[1
]}

Palini, Francesco ^{[1
]}

Cattaneo, Giuseppe ^{[2
]}

Giancarlo, Raffaele ^{[3
]}

机构：

[1] Univ Roma La Sapienza, Dipartimento Sci Statist, I-00185 Rome, Italy

[2] Univ Salerno, Dipartimento Informat, I-84084 Fisciano, SA, Italy

[3] Univ Palermo, Dipartimento Matemat Informat, I-90133 Palermo, Italy

来源：

BIOINFORMATICS | 2021年 / 37卷 / 12期

关键词：

STATISTICS;

D O I：

10.1093/bioinformatics/btab014

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well-established alternative to pairwise and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in computational biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (i) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (ii) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (iii) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE.

引用

页码：1658 / 1665

页数：8

共 50 条

[21] Pattern recognition and probabilistic measures in alignment-free sequence analysis
Schwende, Isabel
Pham, Tuan D.
[J]. BRIEFINGS IN BIOINFORMATICS, 2014, 15 (03) : 354 - 368
[22] Genomic Signal Processing Methods for Computation of Alignment-Free Distances from DNA Sequences
Borrayo, Ernesto
Gerardo Mendizabal-Ruiz, E.
Velez-Perez, Hugo
Romo-Vazquez, Rebeca
Mendizabal, Adriana P.
Alejandro Morales, J.
[J]. PLOS ONE, 2014, 9 (11):
[23] Karaoker: Alignment-free singing voice synthesis with speech training data
Kakoulidis, Panos
Ellinas, Nikolaos
Vamvoukakis, Georgios
Markopoulos, Konstantinos
Sung, June Sig
Jho, Gunu
Tsiakoulis, Pirros
Chalamandaris, Aimilios
[J]. INTERSPEECH 2022, 2022, : 2993 - 2997
[24] Dissimilarities in alignment-free methods for phylogenetic analysis based on genomes
[J]. Yu, Z.-G. (yuzuguo@aliyun.com), 1600, Bentham Science Publishers (09):
[25] An alignment-free method for detection of missing regions for phylogenetic analysis
Islam, Rubyeat
Rahman, Atif
[J]. HELIYON, 2024, 10 (11)
[26] Spark : A Big Data Processing Platform Based On Memory Computing
Han, Zhijie
Zhang, Yujie
[J]. 2015 SEVENTH INTERNATIONAL SYMPOSIUM ON PARALLEL ARCHITECTURES, ALGORITHMS AND PROGRAMMING (PAAP), 2015, : 172 - 176
[27] Apache Spark a Big Data Analytics Platform for Smart Grid
Shyam, R.
Ganesh, Bharathi H. B.
Kumar, Sachin S.
Poornachandran, Prabaharan
Soman, K. P.
[J]. SMART GRID TECHNOLOGIES (ICSGT- 2015), 2015, 21 : 171 - 178
[28] SPARK-A Big Data Processing Platform for Machine Learning
Fu, Jian
Sun, Junwei
Wang, Kaiyuan
[J]. 2016 2ND INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS - COMPUTING TECHNOLOGY, INTELLIGENT TECHNOLOGY, INDUSTRIAL INFORMATION INTEGRATION (ICIICII), 2016, : 48 - 51
[29] Social Media Data Processing Infrastructure by Using Apache Spark Big Data Platform: Twitter Data Analysis
Podhoranyi, Michal
Vojacek, Lukas
[J]. 2019 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTERNET OF THINGS (CCIOT 2019), 2019, : 1 - 6
[30] Alignment-Free Antimicrobial Peptide Predictors: Improving Performance by a Thorough Analysis of the Largest Available Data Set
Pinacho-Castellanos, Sergio A.
Garcia-Jacas, Cesar R.
Gilson, Michael K.
Brizuela, Carlos A.
[J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2021, 61 (06) : 3141 - 3157

← 1 2 3 4 5 →