Optimization of hadoop cluster for analyzing large-scale sequence data in bioinformatics

被引:0
|
作者
Toth, Adam [1 ]
Karimi, Ramin [1 ]
机构
[1] Univ Debrecen, Fac Informat, Debrecen, Hungary
来源
关键词
hadoop; optimization; next-Generation Sequencing; DNA signature; resource management; TECHNOLOGIES;
D O I
10.33039/ami.2019.01.002
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Unexpected growth of high-throughput sequencing platforms in recent years impacted virtually all areas of modern biology. However, the ability to produce data continues to outpace the ability to analyze them. Therefore, continuous efforts are also needed to improve bioinformatics applications for a better use of these research opportunities. Due to the complexity and diversity of metagenomics data, it has been a major challenging field of bioinformatics. Sequence-based identification methods such as using DNA signature (unique k-mer) are the most recent popular methods of real-time analysis of raw sequencing data. DNA signature discovery is compute-intensive and time-consuming. Hadoop, the application of parallel and distributed computing is one of the popular applications for the analysis of large scale data in bioinformatics. Optimization of the time-consumption and computational resource usages such as CPU consumption and memory usage are the main goals of this paper, along with the management of the Hadoop cluster nodes.
引用
收藏
页码:187 / 202
页数:16
相关论文
共 50 条
  • [1] A Data Locality Optimization Algorithm for Large-scale Data Processing in Hadoop
    Zhao, Yanrong
    Wang, Weiping
    Meng, Dan
    Yang, Xiufeng
    Zhang, Shubin
    Li, Jun
    Guan, Gang
    [J]. 2012 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (ISCC), 2012, : 655 - 661
  • [2] Monitoring and Analyzing Big Traffic Data of a Large-Scale Cellular Network with Hadoop
    Liu, Jun
    Liu, Feng
    Ansari, Nirwan
    [J]. IEEE NETWORK, 2014, 28 (04): : 32 - 39
  • [3] Large-Scale Machine Learning and Optimization for Bioinformatics Data Analysis
    Cheng, Jianlin
    [J]. ACM-BCB 2020 - 11TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, 2020,
  • [4] BioPig: a Hadoop-based analytic toolkit for large-scale sequence data
    Nordberg, Henrik
    Bhatia, Karan
    Wang, Kai
    Wang, Zhong
    [J]. BIOINFORMATICS, 2013, 29 (23) : 3014 - 3019
  • [5] Hadoop-HBase for Large-Scale Data
    Vora, Mehul Nalin
    [J]. 2011 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT), VOLS 1-4, 2012, : 601 - 605
  • [6] Analyzing Patterns in Large-Scale Graphs Using MapReduce in Hadoop
    Schultz, Joshua
    Vieyra, Jonathan
    Lu, Enyue
    [J]. 2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC), 2012, : 1459 - 1459
  • [7] Analyzing Patterns in Large-Scale Graphs Using MapReduce in Hadoop
    Schultz, Joshua
    Vierya, Jonathan
    Lu, Enyue
    [J]. 2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC), 2012, : 1457 - +
  • [8] Large-Scale Pairwise Sequence Alignments on a Large-Scale GPU Cluster
    Savran, Ibrahim
    Gao, Yang
    Bakos, Jason D.
    [J]. IEEE DESIGN & TEST, 2014, 31 (01) : 51 - 61
  • [9] Large-scale open bioinformatics data resources
    Stupka, E
    [J]. CURRENT OPINION IN MOLECULAR THERAPEUTICS, 2002, 4 (03) : 265 - 274
  • [10] Efficient bioinformatics approaches for large-scale data analysis
    Hautaniemi, S.
    [J]. FEBS JOURNAL, 2011, 278 : 27 - 27