Alignment-free Genomic Analysis via a Big Data Spark Platform

被引:6
|
作者
Petrillo, Umberto Ferraro [1 ]
Palini, Francesco [1 ]
Cattaneo, Giuseppe [2 ]
Giancarlo, Raffaele [3 ]
机构
[1] Univ Roma La Sapienza, Dipartimento Sci Statist, I-00185 Rome, Italy
[2] Univ Salerno, Dipartimento Informat, I-84084 Fisciano, SA, Italy
[3] Univ Palermo, Dipartimento Matemat Informat, I-90133 Palermo, Italy
关键词
STATISTICS;
D O I
10.1093/bioinformatics/btab014
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well-established alternative to pairwise and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in computational biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (i) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (ii) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (iii) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE.
引用
收藏
页码:1658 / 1665
页数:8
相关论文
共 50 条
  • [21] Pattern recognition and probabilistic measures in alignment-free sequence analysis
    Schwende, Isabel
    Pham, Tuan D.
    [J]. BRIEFINGS IN BIOINFORMATICS, 2014, 15 (03) : 354 - 368
  • [22] Genomic Signal Processing Methods for Computation of Alignment-Free Distances from DNA Sequences
    Borrayo, Ernesto
    Gerardo Mendizabal-Ruiz, E.
    Velez-Perez, Hugo
    Romo-Vazquez, Rebeca
    Mendizabal, Adriana P.
    Alejandro Morales, J.
    [J]. PLOS ONE, 2014, 9 (11):
  • [23] Karaoker: Alignment-free singing voice synthesis with speech training data
    Kakoulidis, Panos
    Ellinas, Nikolaos
    Vamvoukakis, Georgios
    Markopoulos, Konstantinos
    Sung, June Sig
    Jho, Gunu
    Tsiakoulis, Pirros
    Chalamandaris, Aimilios
    [J]. INTERSPEECH 2022, 2022, : 2993 - 2997
  • [24] Dissimilarities in alignment-free methods for phylogenetic analysis based on genomes
    [J]. Yu, Z.-G. (yuzuguo@aliyun.com), 1600, Bentham Science Publishers (09):
  • [25] An alignment-free method for detection of missing regions for phylogenetic analysis
    Islam, Rubyeat
    Rahman, Atif
    [J]. HELIYON, 2024, 10 (11)
  • [26] Spark : A Big Data Processing Platform Based On Memory Computing
    Han, Zhijie
    Zhang, Yujie
    [J]. 2015 SEVENTH INTERNATIONAL SYMPOSIUM ON PARALLEL ARCHITECTURES, ALGORITHMS AND PROGRAMMING (PAAP), 2015, : 172 - 176
  • [27] Apache Spark a Big Data Analytics Platform for Smart Grid
    Shyam, R.
    Ganesh, Bharathi H. B.
    Kumar, Sachin S.
    Poornachandran, Prabaharan
    Soman, K. P.
    [J]. SMART GRID TECHNOLOGIES (ICSGT- 2015), 2015, 21 : 171 - 178
  • [28] SPARK-A Big Data Processing Platform for Machine Learning
    Fu, Jian
    Sun, Junwei
    Wang, Kaiyuan
    [J]. 2016 2ND INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS - COMPUTING TECHNOLOGY, INTELLIGENT TECHNOLOGY, INDUSTRIAL INFORMATION INTEGRATION (ICIICII), 2016, : 48 - 51
  • [29] Social Media Data Processing Infrastructure by Using Apache Spark Big Data Platform: Twitter Data Analysis
    Podhoranyi, Michal
    Vojacek, Lukas
    [J]. 2019 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTERNET OF THINGS (CCIOT 2019), 2019, : 1 - 6
  • [30] Alignment-Free Antimicrobial Peptide Predictors: Improving Performance by a Thorough Analysis of the Largest Available Data Set
    Pinacho-Castellanos, Sergio A.
    Garcia-Jacas, Cesar R.
    Gilson, Michael K.
    Brizuela, Carlos A.
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2021, 61 (06) : 3141 - 3157