Optimization of hadoop cluster for analyzing large-scale sequence data in bioinformatics

被引:0
|
作者
Toth, Adam [1 ]
Karimi, Ramin [1 ]
机构
[1] Univ Debrecen, Fac Informat, Debrecen, Hungary
来源
关键词
hadoop; optimization; next-Generation Sequencing; DNA signature; resource management; TECHNOLOGIES;
D O I
10.33039/ami.2019.01.002
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Unexpected growth of high-throughput sequencing platforms in recent years impacted virtually all areas of modern biology. However, the ability to produce data continues to outpace the ability to analyze them. Therefore, continuous efforts are also needed to improve bioinformatics applications for a better use of these research opportunities. Due to the complexity and diversity of metagenomics data, it has been a major challenging field of bioinformatics. Sequence-based identification methods such as using DNA signature (unique k-mer) are the most recent popular methods of real-time analysis of raw sequencing data. DNA signature discovery is compute-intensive and time-consuming. Hadoop, the application of parallel and distributed computing is one of the popular applications for the analysis of large scale data in bioinformatics. Optimization of the time-consumption and computational resource usages such as CPU consumption and memory usage are the main goals of this paper, along with the management of the Hadoop cluster nodes.
引用
收藏
页码:187 / 202
页数:16
相关论文
共 50 条
  • [41] Bioinformatics Protocols for Quickly Obtaining Large-Scale Data Sets for Phylogenetic Inferences
    Lopez-Fernandez, Hugo
    Duque, Pedro
    Henriques, Silvia
    Vazquez, Noe
    Fdez-Riverola, Florentino
    Vieira, Cristina P.
    Reboiro-Jato, Miguel
    Vieira, Jorge
    [J]. INTERDISCIPLINARY SCIENCES-COMPUTATIONAL LIFE SCIENCES, 2019, 11 (01) : 1 - 9
  • [42] Bioinformatics Protocols for Quickly Obtaining Large-Scale Data Sets for Phylogenetic Inferences
    Hugo López-Fernández
    Pedro Duque
    Sílvia Henriques
    Noé Vázquez
    Florentino Fdez-Riverola
    Cristina P. Vieira
    Miguel Reboiro-Jato
    Jorge Vieira
    [J]. Interdisciplinary Sciences: Computational Life Sciences, 2019, 11 : 1 - 9
  • [43] An integrated web interface for large-scale characterization of sequence data
    Cheung K.-H.
    Kumar A.
    Snyder M.
    Miller P.
    [J]. Functional & Integrative Genomics, 2000, 1 (1) : 70 - 75
  • [44] Bioinformatics of large-scale protein interaction networks
    Schächter, V
    [J]. BIOTECHNIQUES, 2002, : 16 - +
  • [45] Analyzing the evolution of large-scale software
    Mens, T
    Ramil, JF
    Godfrey, MW
    [J]. JOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION-RESEARCH AND PRACTICE, 2004, 16 (06): : 363 - 365
  • [46] A strategy for extracting and analyzing large-scale quantitative epistatic interaction data
    Collins, Sean R.
    Schuldiner, Maya
    Krogan, Nevan J.
    Weissman, Jonathan S.
    [J]. GENOME BIOLOGY, 2006, 7 (07)
  • [47] Facilitating the use of large-scale biological data and tools in the era of translational bioinformatics
    Kouskoumvekaki, Irene
    Shublaq, Nour
    Brunak, Soren
    [J]. BRIEFINGS IN BIOINFORMATICS, 2014, 15 (06) : 942 - 952
  • [48] A large-scale analysis of bioinformatics code on GitHub
    Russell, Pamela H.
    Johnson, Rachel L.
    Ananthan, Shreyas
    Harnke, Benjamin
    Carlson, Nichole E.
    [J]. PLOS ONE, 2018, 13 (10):
  • [49] Bioinformatics for the large-scale mouse mutagenesis project
    Gondo, Y
    [J]. KNOWLEDGE-BASED INTELLIGENT INFORMATION ENGINEERING SYSTEMS & ALLIED TECHNOLOGIES, PTS 1 AND 2, 2001, 69 : 763 - 767
  • [50] Bioinformatics of large-scale protein interaction networks
    Schächter, Vincent
    [J]. BioTechniques, 2002, 32 (3 SUPPL.)