Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences

被引：4

作者：

Seiler, Enrico ^{[1
,2
]}

Mehringer, Svenja ^{[1
]}

Darvish, Mitra ^{[2
]}

Turc, Etienne ^{[3
]}

Reinert, Knut ^{[1
]}

机构：

[1] Free Univ Berlin, Dept Math & Comp Sci, Berlin, Germany

[2] Max Planck Inst Mol Genet, Efficient Algorithms Omics Data, Berlin, Germany

[3] ENSTA, Paris, France

来源：

ISCIENCE | 2021年 / 24卷 / 07期

关键词：

CLASSIFICATION; SEARCH;

D O I：

10.1016/j.isci.2021.102782

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

We present Raptor, a system for approximately searching many queries such as next-generation sequencing reads or transcripts in large collections of nucleotide sequences. Raptor uses winnowing minimizers to define a set of representative k-mers, an extension of the interleaved Bloom filters (IBFs) as a set membership data structure and probabilistic thresholding for minimizers. Our approach allows compression and partitioning of the IBF to enable the effective use of secondary memory. We test and show the performance and limitations of the new features using simulated and real datasets. Our data structure can be used to accelerate various core bioinformatics applications. We show this by re-implementing the distributed read mapping tool DREAM-Yara.

引用

页数：19

共 8 条

[1] Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments
Darvish, Mitra
Seiler, Enrico
Mehringer, Svenja
Rahn, Rene
Reinert, Knut
BIOINFORMATICS, 2022, 38 (17) : 4100 - 4108
[2] A space-efficient algorithm for aligning large genomic sequences
Morgenstern, B
BIOINFORMATICS, 2000, 16 (10) : 948 - 949
[3] A unique-order interpolative code for fast querying and space-efficient indexing in information retrieval systems
Cheng, CS
Shann, JJJ
Chung, CP
ITCC 2004: INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: CODING AND COMPUTING, VOL 2, PROCEEDINGS, 2004, : 229 - 235
[4] Unique-order interpolative coding for fast querying and space-efficient indexing in information retrieval systems
Cheng, CS
Shann, JJJ
Chung, CP
INFORMATION PROCESSING & MANAGEMENT, 2006, 42 (02) : 407 - 428
[5] A fast and space-efficient boundary element method for computing electrostatic and hydration effects in large molecules
Tripos, Inc., 1699 S. Hanley Road, St. Louis, MO 63144, United States
不详
J. Comput. Chem., 7 (864-877):
[6] A fast and space-efficient boundary element method for computing electrostatic and hydration effects in large molecules
Zauhar, RJ
Varnek, A
JOURNAL OF COMPUTATIONAL CHEMISTRY, 1996, 17 (07) : 864 - 877
[7] Fast and space-efficient location of heavy or dense segments in run-length encoded sequences - (Extended abstract)
Greenberg, RI
COMPUTING AND COMBINATORICS, PROCEEDINGS, 2003, 2697 : 528 - 536
[8] Fast and space-efficient algorithms for deciding shellability of simplicial complexes of large size using h-assignments
Moriyama, S
Nagai, A
Imai, H
MATHEMATICAL SOFTWARE, PROCEEDINGS, 2002, : 82 - 92

← 1 →