Fast and memory efficient approach for mapping NGS reads to a reference genome

被引：14

作者：

Kumar, Sanjeev ^{[1
]}

Agarwal, Suneeta ^{[1
]}

Ranvijay ^{[1
]}

机构：

[1] NIT Allahabad, CSED, Allahabad 211004, Uttar Pradesh, India

来源：

JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY | 2019年 / 17卷 / 02期

关键词：

Indexing; read alignment; burrows wheeler transform; wavelet tree; suffix array; genome; ALIGNMENT; ACCURATE;

D O I：

10.1142/S0219720019500082

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

New generation sequencing machines: Illumina and Solexa can generate millions of short reads from a given genome sequence on a single run. Alignment of these reads to a reference genome is a core step in Next-generation sequencing data analysis such as genetic variation and genome resequencing etc. Therefore there is a need of a new approach, efficient with respect to memory as well as time to align these enormous reads with the reference genome. Existing techniques such as MAQ, Bowtie, BWA, BWBBLE, Subread, Kart, and Minimap2 require huge memory for whole reference genome indexing and reads alignment. Gapped alignment versions of these techniques are also 20-40% slower than their respective normal versions. In this paper, an efficient approach: WIT for reference genome indexing and reads alignment using Burrows- Wheeler Transform (BWT) and Wavelet Tree (WT) is proposed. Both exact and approximate alignments are possible by it. Experimental work shows that the proposed approach WIT performs the best in case of protein sequence indexing. For indexing, the reference genome space required by WIT is 0.6 N (N is the size of reference genome) whereas existing techniques BWA, Subread, Kart, and Minimap2 require space in between 1.25 N to 5 N. Experimentally, it is also observed that even using such small index size alignment time of proposed approach is comparable in comparison to BWA, Subread, Kart, and Minimap2. Other alignment parameters accuracy and confidentiality are also experimentally shown to be better than Minimap2. The source code of the proposed approach WIT is available at http://www.algorithm-skg.com/wit/home.html.

引用

页数：17

共 50 条

[1] An enrichment method for mapping ambiguous reads to the reference genome for NGS analysis
Liu, Yuan
Ma, Yongchao
Salsman, Evan
Manthey, Frank A.
Elias, Elias M.
Li, Xuehui
Yan, Changhui
[J]. JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2019, 17 (06)
[2] The efficient algorithm for mapping next generation sequencing reads to reference genome
Pankiewicz, Patryk
Kusmirek, Wiktor
Nowak, Robert M.
[J]. PHOTONICS APPLICATIONS IN ASTRONOMY, COMMUNICATIONS, INDUSTRY, AND HIGH-ENERGY PHYSICS EXPERIMENTS 2019, 2019, 11176
[3] A Fast and Efficient Algorithm for Mapping Short Sequences to a Reference Genome
Antoniou, Pavlos
Iliopoulos, Costas S.
Mouchard, Laurent
Pissis, Solon P.
[J]. ADVANCES IN COMPUTATIONAL BIOLOGY, 2010, 680 : 399 - 403
[4] An approximate Bayesian approach for mapping paired-end DNA reads to a reference genome
Shrestha, Anish Man Singh
Frith, Martin C.
[J]. BIOINFORMATICS, 2013, 29 (08) : 965 - 972
[5] G-MAPSEQ - A NEW METHOD FOR MAPPING READS TO A REFERENCE GENOME
Wojciechowski, Pawel
Frohmberg, Wojciech
Kierzynka, Michal
Zurkowski, Piotr
Blazewicz, Jacek
[J]. FOUNDATIONS OF COMPUTING AND DECISION SCIENCES, 2016, 41 (02) : 123 - 142
[6] Parallel and Memory-Efficient Reads Indexing for Genome Assembly
Chapuis, Guillaume
Chikhi, Rayan
Lavenier, Dominique
[J]. PARALLEL PROCESSING AND APPLIED MATHEMATICS, PT II, 2012, 7204 : 272 - 280
[7] A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases
Jain, Chirag
Dilthey, Alexander
Koren, Sergey
Aluru, Srinivas
Phillippy, Adam M.
[J]. RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2017, 2017, 10229 : 66 - 81
[8] A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases
Jain, Chirag
Dilthey, Alexander
Koren, Sergey
Aluru, Srinivas
Phillippy, Adam M.
[J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2018, 25 (07) : 766 - 779
[9] Fast and memory-efficient mapping of short bisulfite sequencing reads using a two-letter alphabet
Brandine, Guilherme de Sena
Smith, Andrew D.
[J]. NAR GENOMICS AND BIOINFORMATICS, 2021, 3 (04)
[10] GenomeScope: fast reference-free genome profiling from short reads
Vurture, Gregory W.
Sedlazeck, Fritz J.
Nattestad, Maria
Underwood, Charles J.
Fang, Han
Gurtowski, James
Schatz, Michael C.
[J]. BIOINFORMATICS, 2017, 33 (14) : 2202 - 2204

← 1 2 3 4 5 →