A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases

被引:27
|
作者
Jain, Chirag [1 ,2 ]
Dilthey, Alexander [2 ]
Koren, Sergey [2 ]
Aluru, Srinivas [1 ]
Phillippy, Adam M. [2 ]
机构
[1] Georgia Inst Technol, Sch Computat Sci & Engn, Atlanta, GA 30332 USA
[2] NHGRI, NIH, Bethesda, MD 20894 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
Jaccard; long-read mapping; MinHash; minimizers; sketching; winnowing; GENOME; ALIGNMENT; TIME;
D O I
10.1089/cmb.2018.0036
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this article, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290xfaster than Burrows-Wheeler Aligner-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each 5kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and >60,000 genomes.
引用
收藏
页码:766 / 779
页数:14
相关论文
共 50 条
  • [1] A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases
    Jain, Chirag
    Dilthey, Alexander
    Koren, Sergey
    Aluru, Srinivas
    Phillippy, Adam M.
    [J]. RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2017, 2017, 10229 : 66 - 81
  • [2] Fast and Accurate Algorithms for Mapping and Aligning Long Reads
    Yang, Wen
    Wang, Lusheng
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2021, 28 (08) : 789 - 803
  • [3] Fast and memory efficient approach for mapping NGS reads to a reference genome
    Kumar, Sanjeev
    Agarwal, Suneeta
    Ranvijay
    [J]. JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2019, 17 (02)
  • [4] An approximate Bayesian approach for mapping paired-end DNA reads to a reference genome
    Shrestha, Anish Man Singh
    Frith, Martin C.
    [J]. BIOINFORMATICS, 2013, 29 (08) : 965 - 972
  • [5] The efficient algorithm for mapping next generation sequencing reads to reference genome
    Pankiewicz, Patryk
    Kusmirek, Wiktor
    Nowak, Robert M.
    [J]. PHOTONICS APPLICATIONS IN ASTRONOMY, COMMUNICATIONS, INDUSTRY, AND HIGH-ENERGY PHYSICS EXPERIMENTS 2019, 2019, 11176
  • [6] Fast Mapping and Precise Alignment of AB SOLiD Color Reads to Reference DNA
    Csuroes, Miklos
    Juhos, Szilveszter
    Berces, Attila
    [J]. ALGORITHMS IN BIOINFORMATICS, 2010, 6293 : 176 - +
  • [7] Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads
    Lunter, Gerton
    Goodson, Martin
    [J]. GENOME RESEARCH, 2011, 21 (06) : 936 - 939
  • [8] kngMap: Sensitive and Fast Mapping Algorithm for Noisy Long Reads Based on the K-Mer Neighborhood Graph
    Wei, Ze-Gang
    Fan, Xing-Guo
    Zhang, Hao
    Zhang, Xiao-Dan
    Liu, Fei
    Qian, Yu
    Zhang, Shao-Wu
    [J]. FRONTIERS IN GENETICS, 2022, 13
  • [9] A fast parallel clustering algorithm for large spatial databases
    Xu, XW
    Jäger, J
    Kriegel, HP
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 1999, 3 (03) : 263 - 290
  • [10] A Fast Parallel Clustering Algorithm for Large Spatial Databases
    Xiaowei Xu
    Jochen Jäger
    Hans-Peter Kriegel
    [J]. Data Mining and Knowledge Discovery, 1999, 3 : 263 - 290