Finding maximal exact matches in graphs

被引:0
|
作者
Rizzo, Nicola [1 ]
Caceres, Manuel [1 ]
Makinen, Veli [1 ]
机构
[1] Univ Helsinki, Dept Comp Sci, Pietari Kalmin katu 5,POB 68, Helsinki 00014, Finland
基金
欧盟地平线“2020”;
关键词
Sequence to graph alignment; Bidirectional BWT; r-index; Suffix tree; Founder graphs; SEARCH; CONSTRUCTION; RETRIEVAL; SEQUENCE; TREE;
D O I
10.1186/s13015-024-00255-5
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
BackgroundWe study the problem of finding maximal exact matches (MEMs) between a query string Q and a labeled graph G. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least kappa\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa$$\end{document} (kappa\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa$$\end{document}-MEMs). However, on arbitrary input graphs, the problem of finding MEMs cannot be solved in truly sub-quadratic time under SETH (Equi et al., TALG 2023) even on acyclic graphs.ResultsIn this paper we show an O(n center dot L center dot dL-1+m+M kappa,L)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n\cdot L \cdot d<^>{L-1} + m + M_{\kappa ,L})$$\end{document}-time algorithm finding all kappa\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa$$\end{document}-MEMs between Q and G spanning exactly L nodes in G, where n is the total length of node labels, d is the maximum degree of a node in G, m=|Q|\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m = |Q|$$\end{document}, and M kappa,L\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_{\kappa ,L}$$\end{document} is the number of output MEMs. We use this algorithm to develop a kappa\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa$$\end{document}-MEM finding solution on indexable Elastic Founder Graphs (Equi et al. , Algorithmica 2022) running in time O(nH2+m+M kappa)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(nH<^>2 + m + M_\kappa )$$\end{document}, where H is the maximum number of nodes in a block, and M kappa\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_\kappa$$\end{document} is the total number of kappa\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\kappa$$\end{document}-MEMs. Our results generalize to the analysis of multiple query strings (MEMs between G and any of the strings). Additionally, we provide some experimental results showing that the number of graph MEMs is an order of magnitude smaller than the number of string MEMs of the corresponding concatenated collection.ConclusionsWe show that seed-chain-extend type of alignment methods can be implemented on top of indexable Elastic Founder Graphs by providing an efficient way to produce the seeds between a set of queries and the graph. The code is available in https://github.com/algbio/efg-mems.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Finding maximal exact matches in graphs
    Nicola Rizzo
    Manuel Cáceres
    Veli Mäkinen
    Algorithms for Molecular Biology, 19
  • [2] Chaining of Maximal Exact Matches in Graphs
    Rizzo, Nicola
    Caceres, Manuel
    Makinen, Veli
    STRING PROCESSING AND INFORMATION RETRIEVAL, SPIRE 2023, 2023, 14240 : 353 - 366
  • [3] MONI: A Pangenomic Index for Finding Maximal Exact Matches
    Rossi, Massimiliano
    Oliva, Marco
    Langmead, Ben
    Gagie, Travis
    Boucher, Christina
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2022, 29 (02) : 169 - 187
  • [4] Finding Maximal Exact Matches Using the r-Index
    Rossi, Massimiliano
    Oliva, Marco
    Bonizzoni, Paola
    Langmead, Ben
    Gagie, Travis
    Boucher, Christina
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2022, 29 (02) : 188 - 194
  • [5] copMEM: finding maximal exact matches via sampling both genomes
    Grabowski, Szymon
    Bieniecki, Wojciech
    BIOINFORMATICS, 2019, 35 (04) : 677 - 678
  • [6] essaMEM: finding maximal exact matches using enhanced sparse suffix arrays
    Vyverman, Michael
    De Baets, Bernard
    Fack, Veerle
    Dawyndt, Peter
    BIOINFORMATICS, 2013, 29 (06) : 802 - 804
  • [7] Extracting Maximal Exact Matches on GPU
    Abu-Doleh, Anas
    Kaya, Kamer
    Abouelhoda, Mohamed
    Catalyurek, Umit V.
    PROCEEDINGS OF 2014 IEEE INTERNATIONAL PARALLEL & DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2014, : 1418 - 1427
  • [8] A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays
    Khan, Zia
    Bloom, Joshua S.
    Kruglyak, Leonid
    Singh, Mona
    BIOINFORMATICS, 2009, 25 (13) : 1609 - 1616
  • [9] Faster Maximal Exact Matches with Lazy LCP Evaluation
    Goga, Adrian
    Depuydt, Lore
    Brown, Nathaniel K.
    Fostier, Jan
    Gagie, Travis
    Navarro, Gonzalo
    2024 DATA COMPRESSION CONFERENCE, DCC, 2024, : 123 - 132
  • [10] Practical Distributed Computation of Maximal Exact Matches in the Cloud
    El-Din, Sondos Seif
    Aboelhoda, Mohamed
    2014 IEEE-EMBS INTERNATIONAL CONFERENCE ON BIOMEDICAL AND HEALTH INFORMATICS (BHI), 2014, : 609 - 613