Alignment-free Genomic Analysis via a Big Data Spark Platform

被引:6
|
作者
Petrillo, Umberto Ferraro [1 ]
Palini, Francesco [1 ]
Cattaneo, Giuseppe [2 ]
Giancarlo, Raffaele [3 ]
机构
[1] Univ Roma La Sapienza, Dipartimento Sci Statist, I-00185 Rome, Italy
[2] Univ Salerno, Dipartimento Informat, I-84084 Fisciano, SA, Italy
[3] Univ Palermo, Dipartimento Matemat Informat, I-90133 Palermo, Italy
关键词
STATISTICS;
D O I
10.1093/bioinformatics/btab014
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well-established alternative to pairwise and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in computational biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (i) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (ii) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (iii) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE.
引用
下载
收藏
页码:1658 / 1665
页数:8
相关论文
共 50 条
  • [41] KTYPER: FAST AND ACCURATE ALIGNMENT-FREE HLA GENOTYPING WITH NANOPORE SEQUENCE DATA
    Klasberg, Steffen
    Putke, Kathrin
    Fuhrmann, Markus
    Surendranath, Vineeth
    Schmidt, Alexander H.
    Lange, Vinzenz
    Schoefl, Gerhard
    HLA, 2020, 95 (04) : 305 - 305
  • [42] MICADo - Looking for Mutations in Targeted PacBio Cancer Data: An Alignment-Free Method
    Rudewicz, Justine
    Soueidan, Hayssam
    Uricaru, Raluca
    Bonnefoi, Herve
    Iggo, Richard
    Bergh, Jonas
    Nikolski, Macha
    FRONTIERS IN GENETICS, 2016, 7
  • [43] SCAN: A Smart Application Platform for Empowering Parallelizations of Big Genomic Data Analysis in Clouds
    Xing, Wei
    Jie, Wei
    Miller, Crispin
    2015 44TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2015, : 290 - 299
  • [44] Independent phase modulation of transmitted and reflected light via alignment-free bilayer metasurface
    Sung, Jangwoon
    Lee, Gun-Yeal
    Choi, Chulsoo
    Lee, Byoungho
    2017 IEEE WORKSHOP ON RECENT ADVANCES IN PHOTONICS (WRAP), 2017,
  • [45] Alignment-Free Angular Momentum Detection via Spin-Independent Astigmatic Transformation
    Jiang, Mengna
    Chen, Yan
    Zhang, Fei
    Pu, Mingbo
    Guo, Yinghui
    Xu, Mingfeng
    Yue, Weisheng
    He, Qiong
    Gao, Ping
    Luo, Xiangang
    ADVANCED OPTICAL MATERIALS, 2024, 12 (02)
  • [46] Architecture Design of Distributed Medical Big Data Platform Based on Spark
    Tu, Yongqiu
    Lu, Yiqiang
    Chen, Guohua
    Zhao, Jie
    Yi, Faling
    PROCEEDINGS OF 2019 IEEE 8TH JOINT INTERNATIONAL INFORMATION TECHNOLOGY AND ARTIFICIAL INTELLIGENCE CONFERENCE (ITAIC 2019), 2019, : 682 - 685
  • [47] Research on Spark Big Data Recommendation Algorithm under Hadoop Platform
    Huang, Zubang
    2018 4TH INTERNATIONAL CONFERENCE ON ENVIRONMENTAL SCIENCE AND MATERIAL APPLICATION, 2019, 252
  • [48] Apriori algorithm optimization based on Spark platform under big data
    Yu, Huafeng
    MICROPROCESSORS AND MICROSYSTEMS, 2021, 80
  • [49] Big Data Platform for Oil and Gas Production Based on Apache Spark
    Qing, Peng
    Li, Yi
    Luo, Shuqin
    Xu, Zhuoqun
    MODERN INDUSTRIAL IOT, BIG DATA AND SUPPLY CHAIN, IIOTBDSC 2020, 2021, 218 : 129 - 141
  • [50] SomaticSiMu: A Mutational Signature Simulator for Benchmarking Alignment-free Machine Learning Classification of Genomic Signatures.
    Chen, D.
    Randhawa, G. S.
    Soltysiak, M. P. M.
    Kara, L.
    Hill, K. A.
    ENVIRONMENTAL AND MOLECULAR MUTAGENESIS, 2020, 61 : 42 - 43