Chaining for accurate alignment of erroneous long reads to acyclic variation graphs

被引:8
|
作者
Ma, Jun [1 ]
Caceres, Manuel [1 ]
Salmela, Leena [1 ]
Makinen, Veli [1 ]
Tomescu, Alexandru, I [1 ]
机构
[1] Univ Helsinki, Dept Comp Sci, Helsinki 00014, Finland
基金
欧洲研究理事会; 芬兰科学院;
关键词
ALGORITHMS;
D O I
10.1093/bioinformatics/btad460
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Aligning reads to a variation graph is a standard task in pangenomics, with downstream applications such as improving variant calling. While the vg toolkit [Garrison et al. (Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2018;36:875-9)] is a popular aligner of short reads, GraphAligner [Rautiainen and Marschall (GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol 2020;21:253-28)] is the state-of-the-art aligner of erroneous long reads. GraphAligner works by finding candidate read occurrences based on individually extending the best seeds of the read in the variation graph. However, a more principled approach recognized in the community is to co-linearly chain multiple seeds. Results: We present a new algorithm to co-linearly chain a set of seeds in a string labeled acyclic graph, together with the first efficient implementation of such a co-linear chaining algorithm into a new aligner of erroneous long reads to acyclic variation graphs, GraphChainer. We run experiments aligning real and simulated PacBio CLR reads with average error rates 15% and 5%. Compared to GraphAligner, GraphChainer aligns 12-17% more reads, and 21-28% more total read length, on real PacBio CLR reads from human chromosomes 1, 22, and the whole human pangenome. On both simulated and real data, GraphChainer aligns between 95% and 99% of all reads, and of total read length. We also show that minigraph [Li et al. (The design and construction of reference pangenome graphs with minigraph. Genome Biol 2020;21:265-19.)] and minichain [Chandra and Jain (Sequence to graph alignment using gap-sensitive co-linear chaining. In: Proceedings of the 27th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2023). Springer, 2023, 58-73.)] obtain an accuracy of <60% on this setting. Availability and implementation: GraphChainer is freely available at https://github.com/algbio/GraphChainer. The datasets and evaluation pipeline can be reached from the previous address.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Accurate spliced alignment of long RNA sequencing reads
    Sahlin, Kristoffer
    Makinen, Veli
    BIOINFORMATICS, 2021, 37 (24) : 4643 - 4651
  • [2] HEURISTIC CHAINING IN DIRECTED ACYCLIC GRAPHS
    VENUGOPAL, R
    SRIKANT, YN
    COMPUTER LANGUAGES, 1993, 19 (03): : 169 - 184
  • [3] VeChat: correcting errors in long reads using variation graphs
    Luo, Xiao
    Kang, Xiongbin
    Schoenhuth, Alexander
    NATURE COMMUNICATIONS, 2022, 13 (01)
  • [4] VeChat: correcting errors in long reads using variation graphs
    Xiao Luo
    Xiongbin Kang
    Alexander Schönhuth
    Nature Communications, 13
  • [5] Accurate self-correction of errors in long reads using de Bruijn graphs
    Salmela, Leena
    Walve, Riku
    Rivals, Eric
    Ukkonen, Esko
    BIOINFORMATICS, 2017, 33 (06) : 799 - 806
  • [6] Gap-Sensitive Colinear Chaining Algorithms for Acyclic Pangenome Graphs
    Chandra, Ghanshyam
    Jain, Chirag
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2023, 30 (11) : 1182 - 1197
  • [7] BatAlign: an incremental method for accurate alignment of sequencing reads
    Lim, Jing-Quan
    Tennakoon, Chandana
    Guan, Peiyong
    Sung, Wing-Kin
    NUCLEIC ACIDS RESEARCH, 2015, 43 (16)
  • [8] AccuRA: Accurate Alignment of Short Reads on Scalable Reconfigurable Accelerators
    Natarajan, Santhi
    Kumar, Krishna N.
    Pal, Dehnath
    Nandy, S. K.
    PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON EMBEDDED COMPUTER SYSTEMS: ARCHITECTURES, MODELING AND SIMULATION (SAMOS), 2016, : 79 - 87
  • [9] SAP-A Sequence Mapping and Analyzing Program for Long Sequence Reads Alignment and Accurate Variants Discovery
    Sun, Zheng
    Tian, Weidong
    PLOS ONE, 2012, 7 (08):
  • [10] Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads
    Ye, Chengxi
    Ma, Zhanshan
    PEERJ, 2016, 4