Querying large read collections in main memory: a versatile data structure

被引:9
|
作者
Philippe, Nicolas [1 ,2 ,3 ]
Salson, Mikael [4 ,5 ,6 ]
Lecroq, Thierry [4 ]
Leonard, Martine [4 ]
Commes, Therese [3 ]
Rivals, Eric [1 ,2 ]
机构
[1] CNRS, UMR 5506, LIRMM, F-34095 Montpellier, France
[2] Univ Montpellier 2, F-34095 Montpellier, France
[3] CNRS, UMR 5237, CRBM, F-34293 Montpellier 5, France
[4] Univ Rouen, LITIS EA 4108, F-76821 Mont St Aignan, France
[5] Univ Lille 1, CNRS, LIFL, UMR 8022, F-59655 Villeneuve Dascq, France
[6] INRIA Lille Nord Europe, F-59655 Villeneuve Dascq, France
来源
BMC BIOINFORMATICS | 2011年 / 12卷
关键词
ERROR-CORRECTION; ALIGNMENT; GENOMES;
D O I
10.1186/1471-2105-12-242
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: High Throughput Sequencing (HTS) is now heavily exploited for genome (re-) sequencing, metagenomics, epigenomics, and transcriptomics and requires different, but computer intensive bioinformatic analyses. When a reference genome is available, mapping reads on it is the first step of this analysis. Read mapping programs owe their efficiency to the use of involved genome indexing data structures, like the Burrows-Wheeler transform. Recent solutions index both the genome, and the k-mers of the reads using hash-tables to further increase efficiency and accuracy. In various contexts (e. g. assembly or transcriptome analysis), read processing requires to determine the sub-collection of reads that are related to a given sequence, which is done by searching for some k-mers in the reads. Currently, many developments have focused on genome indexing structures for read mapping, but the question of read indexing remains broadly unexplored. However, the increase in sequence throughput urges for new algorithmic solutions to query large read collections efficiently. Results: Here, we present a solution, named Gk arrays, to index large collections of reads, an algorithm to build the structure, and procedures to query it. Once constructed, the index structure is kept in main memory and is repeatedly accessed to answer queries like "given a k-mer, get the reads containing this k-mer (once/at least once)". We compared our structure to other solutions that adapt uncompressed indexing structures designed for long texts and show that it processes queries fast, while requiring much less memory. Our structure can thus handle larger read collections. We provide examples where such queries are adapted to different types of read analysis (SNP detection, assembly, RNA-Seq). Conclusions: Gk arrays constitute a versatile data structure that enables fast and more accurate read analysis in various contexts. The Gk arrays provide a flexible brick to design innovative programs that mine efficiently genomics, epigenomics, metagenomics, or transcriptomics reads. The Gk arrays library is available under Cecill (GPL compliant) license from http://www.atgc-montpellier.fr/ngs/.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Querying large read collections in main memory: a versatile data structure
    Nicolas Philippe
    Mikaël Salson
    Thierry Lecroq
    Martine Léonard
    Thérèse Commes
    Eric Rivals
    [J]. BMC Bioinformatics, 12
  • [2] Data structures based on k-mers for querying large collections of sequencing data sets
    Marchet, Camille
    Boucher, Christina
    Puglisi, Simon J.
    Medvedev, Paul
    Salson, Mikael
    Chikhi, Rayan
    [J]. GENOME RESEARCH, 2021, 31 (01) : 1 - 12
  • [3] An interactive SQL relational interface for querying main-memory data structures
    Marios Fragkoulis
    Diomidis Spinellis
    Panos Louridas
    [J]. Computing, 2015, 97 : 1141 - 1164
  • [4] An interactive SQL relational interface for querying main-memory data structures
    Fragkoulis, Marios
    Spinellis, Diomidis
    Louridas, Panos
    [J]. COMPUTING, 2015, 97 (12) : 1141 - 1164
  • [5] MLC PCM Main Memory with Accelerated Read
    Arjomand, Mohammad
    Jadidi, Amin
    Kandemir, Mahmut T.
    Sivasubramaniam, Anand
    Das, Chita
    [J]. 2016 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE ISPASS 2016, 2016, : 143 - 144
  • [6] Survey of main tools for querying and analyzing TCGA Data
    Settino, Marzia
    Cannataro, Mario
    [J]. PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2018, : 1711 - 1718
  • [7] Fishing in Read Collections: Memory Efficient Indexing for Sequence Assembly
    Boza, Vladimir
    Jursa, Jakub
    Brejova, Brona
    Vinar, Tomas
    [J]. STRING PROCESSING AND INFORMATION RETRIEVAL (SPIRE 2015), 2015, 9309 : 188 - 198
  • [8] The complexity of querying external memory and streaming data
    Grohe, M
    Koch, C
    Schweikardt, N
    [J]. FUNDAMENTALS OF COMPUTATIONAL THEORY, PROCEEDINGS, 2005, 3623 : 1 - 16
  • [9] Fast OLAP Query Execution in Main Memory on Large Data in a Cluster
    Weidner, Martin
    Dees, Jonathan
    Sanders, Peter
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2013,
  • [10] A fast and versatile path index for querying semi-structured data
    Barg, M
    Wong, RK
    [J]. EIGHTH INTERNATIONAL CONFERENCE ON DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 2003, : 249 - 256