Querying large read collections in main memory: a versatile data structure

被引：9

作者：

Philippe, Nicolas ^{[1
,2
,3
]}

Salson, Mikael ^{[4
,5
,6
]}

Lecroq, Thierry ^{[4
]}

Leonard, Martine ^{[4
]}

Commes, Therese ^{[3
]}

Rivals, Eric ^{[1
,2
]}

机构：

[1] CNRS, UMR 5506, LIRMM, F-34095 Montpellier, France

[2] Univ Montpellier 2, F-34095 Montpellier, France

[3] CNRS, UMR 5237, CRBM, F-34293 Montpellier 5, France

[4] Univ Rouen, LITIS EA 4108, F-76821 Mont St Aignan, France

[5] Univ Lille 1, CNRS, LIFL, UMR 8022, F-59655 Villeneuve Dascq, France

[6] INRIA Lille Nord Europe, F-59655 Villeneuve Dascq, France

来源：

BMC BIOINFORMATICS | 2011年 / 12卷

关键词：

ERROR-CORRECTION; ALIGNMENT; GENOMES;

D O I：

10.1186/1471-2105-12-242

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: High Throughput Sequencing (HTS) is now heavily exploited for genome (re-) sequencing, metagenomics, epigenomics, and transcriptomics and requires different, but computer intensive bioinformatic analyses. When a reference genome is available, mapping reads on it is the first step of this analysis. Read mapping programs owe their efficiency to the use of involved genome indexing data structures, like the Burrows-Wheeler transform. Recent solutions index both the genome, and the k-mers of the reads using hash-tables to further increase efficiency and accuracy. In various contexts (e. g. assembly or transcriptome analysis), read processing requires to determine the sub-collection of reads that are related to a given sequence, which is done by searching for some k-mers in the reads. Currently, many developments have focused on genome indexing structures for read mapping, but the question of read indexing remains broadly unexplored. However, the increase in sequence throughput urges for new algorithmic solutions to query large read collections efficiently. Results: Here, we present a solution, named Gk arrays, to index large collections of reads, an algorithm to build the structure, and procedures to query it. Once constructed, the index structure is kept in main memory and is repeatedly accessed to answer queries like "given a k-mer, get the reads containing this k-mer (once/at least once)". We compared our structure to other solutions that adapt uncompressed indexing structures designed for long texts and show that it processes queries fast, while requiring much less memory. Our structure can thus handle larger read collections. We provide examples where such queries are adapted to different types of read analysis (SNP detection, assembly, RNA-Seq). Conclusions: Gk arrays constitute a versatile data structure that enables fast and more accurate read analysis in various contexts. The Gk arrays provide a flexible brick to design innovative programs that mine efficiently genomics, epigenomics, metagenomics, or transcriptomics reads. The Gk arrays library is available under Cecill (GPL compliant) license from http://www.atgc-montpellier.fr/ngs/.

引用

页数：16

共 50 条

[1] Querying large read collections in main memory: a versatile data structure
Nicolas Philippe
Mikaël Salson
Thierry Lecroq
Martine Léonard
Thérèse Commes
Eric Rivals
[J]. BMC Bioinformatics, 12
[2] Data structures based on k-mers for querying large collections of sequencing data sets
Marchet, Camille
Boucher, Christina
Puglisi, Simon J.
Medvedev, Paul
Salson, Mikael
Chikhi, Rayan
[J]. GENOME RESEARCH, 2021, 31 (01) : 1 - 12
[3] An interactive SQL relational interface for querying main-memory data structures
Marios Fragkoulis
Diomidis Spinellis
Panos Louridas
[J]. Computing, 2015, 97 : 1141 - 1164
[4] An interactive SQL relational interface for querying main-memory data structures
Fragkoulis, Marios
Spinellis, Diomidis
Louridas, Panos
[J]. COMPUTING, 2015, 97 (12) : 1141 - 1164
[5] MLC PCM Main Memory with Accelerated Read
Arjomand, Mohammad
Jadidi, Amin
Kandemir, Mahmut T.
Sivasubramaniam, Anand
Das, Chita
[J]. 2016 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE ISPASS 2016, 2016, : 143 - 144
[6] Survey of main tools for querying and analyzing TCGA Data
Settino, Marzia
Cannataro, Mario
[J]. PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2018, : 1711 - 1718
[7] Fishing in Read Collections: Memory Efficient Indexing for Sequence Assembly
Boza, Vladimir
Jursa, Jakub
Brejova, Brona
Vinar, Tomas
[J]. STRING PROCESSING AND INFORMATION RETRIEVAL (SPIRE 2015), 2015, 9309 : 188 - 198
[8] The complexity of querying external memory and streaming data
Grohe, M
Koch, C
Schweikardt, N
[J]. FUNDAMENTALS OF COMPUTATIONAL THEORY, PROCEEDINGS, 2005, 3623 : 1 - 16
[9] Fast OLAP Query Execution in Main Memory on Large Data in a Cluster
Weidner, Martin
Dees, Jonathan
Sanders, Peter
[J]. 2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2013,
[10] A fast and versatile path index for querying semi-structured data
Barg, M
Wong, RK
[J]. EIGHTH INTERNATIONAL CONFERENCE ON DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 2003, : 249 - 256

← 1 2 3 4 5 →