Querying large read collections in main memory: a versatile data structure

被引：9

作者：

Philippe, Nicolas ^{[1
,2
,3
]}

Salson, Mikael ^{[4
,5
,6
]}

Lecroq, Thierry ^{[4
]}

Leonard, Martine ^{[4
]}

Commes, Therese ^{[3
]}

Rivals, Eric ^{[1
,2
]}

机构：

[1] CNRS, UMR 5506, LIRMM, F-34095 Montpellier, France

[2] Univ Montpellier 2, F-34095 Montpellier, France

[3] CNRS, UMR 5237, CRBM, F-34293 Montpellier 5, France

[4] Univ Rouen, LITIS EA 4108, F-76821 Mont St Aignan, France

[5] Univ Lille 1, CNRS, LIFL, UMR 8022, F-59655 Villeneuve Dascq, France

[6] INRIA Lille Nord Europe, F-59655 Villeneuve Dascq, France

来源：

BMC BIOINFORMATICS | 2011年 / 12卷

关键词：

ERROR-CORRECTION; ALIGNMENT; GENOMES;

D O I：

10.1186/1471-2105-12-242

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: High Throughput Sequencing (HTS) is now heavily exploited for genome (re-) sequencing, metagenomics, epigenomics, and transcriptomics and requires different, but computer intensive bioinformatic analyses. When a reference genome is available, mapping reads on it is the first step of this analysis. Read mapping programs owe their efficiency to the use of involved genome indexing data structures, like the Burrows-Wheeler transform. Recent solutions index both the genome, and the k-mers of the reads using hash-tables to further increase efficiency and accuracy. In various contexts (e. g. assembly or transcriptome analysis), read processing requires to determine the sub-collection of reads that are related to a given sequence, which is done by searching for some k-mers in the reads. Currently, many developments have focused on genome indexing structures for read mapping, but the question of read indexing remains broadly unexplored. However, the increase in sequence throughput urges for new algorithmic solutions to query large read collections efficiently. Results: Here, we present a solution, named Gk arrays, to index large collections of reads, an algorithm to build the structure, and procedures to query it. Once constructed, the index structure is kept in main memory and is repeatedly accessed to answer queries like "given a k-mer, get the reads containing this k-mer (once/at least once)". We compared our structure to other solutions that adapt uncompressed indexing structures designed for long texts and show that it processes queries fast, while requiring much less memory. Our structure can thus handle larger read collections. We provide examples where such queries are adapted to different types of read analysis (SNP detection, assembly, RNA-Seq). Conclusions: Gk arrays constitute a versatile data structure that enables fast and more accurate read analysis in various contexts. The Gk arrays provide a flexible brick to design innovative programs that mine efficiently genomics, epigenomics, metagenomics, or transcriptomics reads. The Gk arrays library is available under Cecill (GPL compliant) license from http://www.atgc-montpellier.fr/ngs/.

引用

页数：16

共 50 条

[21] DESIGN OF A LARGE READ-ONLY HOLOGRAPHIC MEMORY
LANGDON, RM
RADIO AND ELECTRONIC ENGINEER, 1969, 38 (06): : 369 - +
[22] On Approximate Querying Large-Scale JSON']JSON Data
Lv, Teng
Yan, Ping
He, Weimin
Wang, Tao
5TH ANNUAL INTERNATIONAL CONFERENCE ON INFORMATION SYSTEM AND ARTIFICIAL INTELLIGENCE (ISAI2020), 2020, 1575
[23] Efficient SQL-querying method for data mining in large data bases
Son, NH
IJCAI-99: PROCEEDINGS OF THE SIXTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 & 2, 1999, : 806 - 811
[24] Managing large collections of data mining models
Liu, Bing
Tuzhilin, Alexander
COMMUNICATIONS OF THE ACM, 2008, 51 (02) : 85 - 89
[25] A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets
Zuo, Chandler
Chen, Kailei
Keles, Suenduez
JOURNAL OF COMPUTATIONAL BIOLOGY, 2017, 24 (06) : 472 - 485
[26] Indexing for Large Scale Data Querying based on Spark SQL
Cui, Yi
Li, Guoqiang
Cheng, Hao
Wang, Daoyuan
2017 IEEE 14TH INTERNATIONAL CONFERENCE ON E-BUSINESS ENGINEERING (ICEBE 2017), 2017, : 103 - 108
[27] A Resilient Index Graph for Querying Large Biological Scientific Data
Li, Liang
Shen, Zhihong
Li, Jianhui
Liu, Dongjiang
Wang, Huajin
Wang, Lipeng
Sun, Qinglan
2017 IEEE 6TH INTERNATIONAL CONGRESS ON BIG DATA (BIGDATA CONGRESS 2017), 2017, : 435 - 443
[28] Querying large physics data sets over an information grid
Baker, N
Brooks, P
Kovacs, Z
Le Goff, JM
McClatchey, R
PROCEEDINGS OF CHEP 2001, 2001, : 663 - 667
[29] Design and Evaluation of Storage Organizations for Read-Optimized Main Memory Databases
Chasseur, Craig
Patel, Jignesh M.
PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (13): : 1474 - 1485
[30] LeanStore: In-Memory Data Management Beyond Main Memory
Leis, Viktor
Haubenschild, Michael
Kemper, Alfons
Neumann, Thomas
2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2018, : 185 - 196

← 1 2 3 4 5 →