Querying large read collections in main memory: a versatile data structure

被引:9
|
作者
Philippe, Nicolas [1 ,2 ,3 ]
Salson, Mikael [4 ,5 ,6 ]
Lecroq, Thierry [4 ]
Leonard, Martine [4 ]
Commes, Therese [3 ]
Rivals, Eric [1 ,2 ]
机构
[1] CNRS, UMR 5506, LIRMM, F-34095 Montpellier, France
[2] Univ Montpellier 2, F-34095 Montpellier, France
[3] CNRS, UMR 5237, CRBM, F-34293 Montpellier 5, France
[4] Univ Rouen, LITIS EA 4108, F-76821 Mont St Aignan, France
[5] Univ Lille 1, CNRS, LIFL, UMR 8022, F-59655 Villeneuve Dascq, France
[6] INRIA Lille Nord Europe, F-59655 Villeneuve Dascq, France
来源
BMC BIOINFORMATICS | 2011年 / 12卷
关键词
ERROR-CORRECTION; ALIGNMENT; GENOMES;
D O I
10.1186/1471-2105-12-242
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: High Throughput Sequencing (HTS) is now heavily exploited for genome (re-) sequencing, metagenomics, epigenomics, and transcriptomics and requires different, but computer intensive bioinformatic analyses. When a reference genome is available, mapping reads on it is the first step of this analysis. Read mapping programs owe their efficiency to the use of involved genome indexing data structures, like the Burrows-Wheeler transform. Recent solutions index both the genome, and the k-mers of the reads using hash-tables to further increase efficiency and accuracy. In various contexts (e. g. assembly or transcriptome analysis), read processing requires to determine the sub-collection of reads that are related to a given sequence, which is done by searching for some k-mers in the reads. Currently, many developments have focused on genome indexing structures for read mapping, but the question of read indexing remains broadly unexplored. However, the increase in sequence throughput urges for new algorithmic solutions to query large read collections efficiently. Results: Here, we present a solution, named Gk arrays, to index large collections of reads, an algorithm to build the structure, and procedures to query it. Once constructed, the index structure is kept in main memory and is repeatedly accessed to answer queries like "given a k-mer, get the reads containing this k-mer (once/at least once)". We compared our structure to other solutions that adapt uncompressed indexing structures designed for long texts and show that it processes queries fast, while requiring much less memory. Our structure can thus handle larger read collections. We provide examples where such queries are adapted to different types of read analysis (SNP detection, assembly, RNA-Seq). Conclusions: Gk arrays constitute a versatile data structure that enables fast and more accurate read analysis in various contexts. The Gk arrays provide a flexible brick to design innovative programs that mine efficiently genomics, epigenomics, metagenomics, or transcriptomics reads. The Gk arrays library is available under Cecill (GPL compliant) license from http://www.atgc-montpellier.fr/ngs/.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] DESIGN OF A LARGE READ-ONLY HOLOGRAPHIC MEMORY
    LANGDON, RM
    RADIO AND ELECTRONIC ENGINEER, 1969, 38 (06): : 369 - +
  • [22] On Approximate Querying Large-Scale JSON']JSON Data
    Lv, Teng
    Yan, Ping
    He, Weimin
    Wang, Tao
    5TH ANNUAL INTERNATIONAL CONFERENCE ON INFORMATION SYSTEM AND ARTIFICIAL INTELLIGENCE (ISAI2020), 2020, 1575
  • [23] Efficient SQL-querying method for data mining in large data bases
    Son, NH
    IJCAI-99: PROCEEDINGS OF THE SIXTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 & 2, 1999, : 806 - 811
  • [24] Managing large collections of data mining models
    Liu, Bing
    Tuzhilin, Alexander
    COMMUNICATIONS OF THE ACM, 2008, 51 (02) : 85 - 89
  • [25] A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets
    Zuo, Chandler
    Chen, Kailei
    Keles, Suenduez
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2017, 24 (06) : 472 - 485
  • [26] Indexing for Large Scale Data Querying based on Spark SQL
    Cui, Yi
    Li, Guoqiang
    Cheng, Hao
    Wang, Daoyuan
    2017 IEEE 14TH INTERNATIONAL CONFERENCE ON E-BUSINESS ENGINEERING (ICEBE 2017), 2017, : 103 - 108
  • [27] A Resilient Index Graph for Querying Large Biological Scientific Data
    Li, Liang
    Shen, Zhihong
    Li, Jianhui
    Liu, Dongjiang
    Wang, Huajin
    Wang, Lipeng
    Sun, Qinglan
    2017 IEEE 6TH INTERNATIONAL CONGRESS ON BIG DATA (BIGDATA CONGRESS 2017), 2017, : 435 - 443
  • [28] Querying large physics data sets over an information grid
    Baker, N
    Brooks, P
    Kovacs, Z
    Le Goff, JM
    McClatchey, R
    PROCEEDINGS OF CHEP 2001, 2001, : 663 - 667
  • [29] Design and Evaluation of Storage Organizations for Read-Optimized Main Memory Databases
    Chasseur, Craig
    Patel, Jignesh M.
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (13): : 1474 - 1485
  • [30] LeanStore: In-Memory Data Management Beyond Main Memory
    Leis, Viktor
    Haubenschild, Michael
    Kemper, Alfons
    Neumann, Thomas
    2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2018, : 185 - 196