Fast and memory efficient approach for mapping NGS reads to a reference genome

被引:14
|
作者
Kumar, Sanjeev [1 ]
Agarwal, Suneeta [1 ]
Ranvijay [1 ]
机构
[1] NIT Allahabad, CSED, Allahabad 211004, Uttar Pradesh, India
关键词
Indexing; read alignment; burrows wheeler transform; wavelet tree; suffix array; genome; ALIGNMENT; ACCURATE;
D O I
10.1142/S0219720019500082
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
New generation sequencing machines: Illumina and Solexa can generate millions of short reads from a given genome sequence on a single run. Alignment of these reads to a reference genome is a core step in Next-generation sequencing data analysis such as genetic variation and genome resequencing etc. Therefore there is a need of a new approach, efficient with respect to memory as well as time to align these enormous reads with the reference genome. Existing techniques such as MAQ, Bowtie, BWA, BWBBLE, Subread, Kart, and Minimap2 require huge memory for whole reference genome indexing and reads alignment. Gapped alignment versions of these techniques are also 20-40% slower than their respective normal versions. In this paper, an efficient approach: WIT for reference genome indexing and reads alignment using Burrows- Wheeler Transform (BWT) and Wavelet Tree (WT) is proposed. Both exact and approximate alignments are possible by it. Experimental work shows that the proposed approach WIT performs the best in case of protein sequence indexing. For indexing, the reference genome space required by WIT is 0.6 N (N is the size of reference genome) whereas existing techniques BWA, Subread, Kart, and Minimap2 require space in between 1.25 N to 5 N. Experimentally, it is also observed that even using such small index size alignment time of proposed approach is comparable in comparison to BWA, Subread, Kart, and Minimap2. Other alignment parameters accuracy and confidentiality are also experimentally shown to be better than Minimap2. The source code of the proposed approach WIT is available at http://www.algorithm-skg.com/wit/home.html.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] An enrichment method for mapping ambiguous reads to the reference genome for NGS analysis
    Liu, Yuan
    Ma, Yongchao
    Salsman, Evan
    Manthey, Frank A.
    Elias, Elias M.
    Li, Xuehui
    Yan, Changhui
    [J]. JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2019, 17 (06)
  • [2] The efficient algorithm for mapping next generation sequencing reads to reference genome
    Pankiewicz, Patryk
    Kusmirek, Wiktor
    Nowak, Robert M.
    [J]. PHOTONICS APPLICATIONS IN ASTRONOMY, COMMUNICATIONS, INDUSTRY, AND HIGH-ENERGY PHYSICS EXPERIMENTS 2019, 2019, 11176
  • [3] A Fast and Efficient Algorithm for Mapping Short Sequences to a Reference Genome
    Antoniou, Pavlos
    Iliopoulos, Costas S.
    Mouchard, Laurent
    Pissis, Solon P.
    [J]. ADVANCES IN COMPUTATIONAL BIOLOGY, 2010, 680 : 399 - 403
  • [4] An approximate Bayesian approach for mapping paired-end DNA reads to a reference genome
    Shrestha, Anish Man Singh
    Frith, Martin C.
    [J]. BIOINFORMATICS, 2013, 29 (08) : 965 - 972
  • [5] G-MAPSEQ - A NEW METHOD FOR MAPPING READS TO A REFERENCE GENOME
    Wojciechowski, Pawel
    Frohmberg, Wojciech
    Kierzynka, Michal
    Zurkowski, Piotr
    Blazewicz, Jacek
    [J]. FOUNDATIONS OF COMPUTING AND DECISION SCIENCES, 2016, 41 (02) : 123 - 142
  • [6] Parallel and Memory-Efficient Reads Indexing for Genome Assembly
    Chapuis, Guillaume
    Chikhi, Rayan
    Lavenier, Dominique
    [J]. PARALLEL PROCESSING AND APPLIED MATHEMATICS, PT II, 2012, 7204 : 272 - 280
  • [7] A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases
    Jain, Chirag
    Dilthey, Alexander
    Koren, Sergey
    Aluru, Srinivas
    Phillippy, Adam M.
    [J]. RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2017, 2017, 10229 : 66 - 81
  • [8] A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases
    Jain, Chirag
    Dilthey, Alexander
    Koren, Sergey
    Aluru, Srinivas
    Phillippy, Adam M.
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2018, 25 (07) : 766 - 779
  • [9] Fast and memory-efficient mapping of short bisulfite sequencing reads using a two-letter alphabet
    Brandine, Guilherme de Sena
    Smith, Andrew D.
    [J]. NAR GENOMICS AND BIOINFORMATICS, 2021, 3 (04)
  • [10] GenomeScope: fast reference-free genome profiling from short reads
    Vurture, Gregory W.
    Sedlazeck, Fritz J.
    Nattestad, Maria
    Underwood, Charles J.
    Fang, Han
    Gurtowski, James
    Schatz, Michael C.
    [J]. BIOINFORMATICS, 2017, 33 (14) : 2202 - 2204