Alignment-free Genomic Analysis via a Big Data Spark Platform

被引:6
|
作者
Petrillo, Umberto Ferraro [1 ]
Palini, Francesco [1 ]
Cattaneo, Giuseppe [2 ]
Giancarlo, Raffaele [3 ]
机构
[1] Univ Roma La Sapienza, Dipartimento Sci Statist, I-00185 Rome, Italy
[2] Univ Salerno, Dipartimento Informat, I-84084 Fisciano, SA, Italy
[3] Univ Palermo, Dipartimento Matemat Informat, I-90133 Palermo, Italy
关键词
STATISTICS;
D O I
10.1093/bioinformatics/btab014
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well-established alternative to pairwise and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in computational biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (i) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (ii) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (iii) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE.
引用
收藏
页码:1658 / 1665
页数:8
相关论文
共 50 条
  • [1] A Big Data Analysis Platform for Healthcare on Apache Spark
    Zhang, Jinwei
    Zhang, Yong
    Hu, Qingcheng
    Tian, Hongliang
    Xing, Chunxiao
    [J]. SMART HEALTH, ICSH 2016, 2017, 10219 : 32 - 43
  • [2] ALIGNMENT-FREE PHYLOGENETIC RECONSTRUCTION: SAMPLE COMPLEXITY VIA A BRANCHING PROCESS ANALYSIS
    Daskalakis, Constantinos
    Roch, Sebastien
    [J]. ANNALS OF APPLIED PROBABILITY, 2013, 23 (02): : 693 - 721
  • [3] CAFE: aCcelerated Alignment-FrEe sequence analysis
    Lu, Yang Young
    Tang, Kujin
    Ren, Jie
    Fuhrman, Jed A.
    Waterman, Michael S.
    Sun, Fengzhu
    [J]. NUCLEIC ACIDS RESEARCH, 2017, 45 (W1) : W554 - W559
  • [4] Alignment-free genomic sequence comparison using FCGR and signal processing
    Lichtblau, Daniel
    [J]. BMC BIOINFORMATICS, 2019, 20 (01)
  • [5] Alignment-free genomic sequence comparison using FCGR and signal processing
    Daniel Lichtblau
    [J]. BMC Bioinformatics, 20
  • [6] Genome Big Data Analysis Platform towards Genomic Medicine
    Imoto, Seiya
    [J]. CANCER SCIENCE, 2022, 113 : 873 - 873
  • [7] Simplification of protein sequence and alignment-free sequence analysis
    Li Jing
    Li Feng-Bo
    Wang Wei
    [J]. PROGRESS IN BIOCHEMISTRY AND BIOPHYSICS, 2006, 33 (12) : 1215 - 1222
  • [8] Alignment-free Visualization of Metagenomic Data by Nonlinear Dimension Reduction
    Laczny, Cedric C.
    Pinel, Nicolas
    Vlassis, Nikos
    Wilmes, Paul
    [J]. SCIENTIFIC REPORTS, 2014, 4
  • [9] Alignment-free Visualization of Metagenomic Data by Nonlinear Dimension Reduction
    Cedric C. Laczny
    Nicolás Pinel
    Nikos Vlassis
    Paul Wilmes
    [J]. Scientific Reports, 4
  • [10] Evaluating Genomic Big Data Operations on SciDB and Spark
    Cattani, Simone
    Ceri, Stefano
    Kaitoua, Abdulrahman
    Pinoli, Pietro
    [J]. WEB ENGINEERING (ICWE 2017), 2017, 10360 : 482 - 493