TrieDedup: a fast trie-based deduplication algorithm to handle ambiguous bases in high-throughput sequencing

被引：1

作者：

Hu, Jianqiao ^{[1
,2
]}

Luo, Sai ^{[1
,3
,4
,5
]}

Tian, Ming ^{[1
,3
]}

Ye, Adam Yongxin ^{[1
,3
,4
]}

机构：

[1] Boston Childrens Hosp, Program Cellular & Mol Med, Boston, MA 02115 USA

[2] Univ Washington, Dept Biol, Seattle, WA USA

[3] Harvard Med Sch, Boston, MA 02115 USA

[4] Boston Childrens Hosp, Howard Hughes Med Inst, Boston, MA 02115 USA

[5] Tsinghua Univ, Sch Basic Med Sci, Beijing, Peoples R China

来源：

BMC BIOINFORMATICS | 2024年 / 25卷 / 01期

基金：

美国国家卫生研究院;

关键词：

Deduplication; Ambiguous bases; Trie; Prefix tree; Next-generation sequencing; FORMAT;

D O I：

10.1186/s12859-024-05775-w

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background High-throughput sequencing is a powerful tool that is extensively applied in biological studies. However, sequencers may produce low-quality bases, leading to ambiguous bases, 'N's. PCR duplicates introduced in library preparation are conventionally removed in genomics studies, and several deduplication tools have been developed for this purpose. Two identical reads may appear different due to ambiguous bases and the existing tools cannot address 'N's correctly or efficiently.Results Here we proposed and implemented TrieDedup, which uses the trie (prefix tree) data structure to compare and store sequences. TrieDedup can handle ambiguous base 'N's, and efficiently deduplicate at the level of raw sequences. We also reduced its memory usage by approximately 20% by implementing restrictedDict in Python. We benchmarked the performance of the algorithm and showed that TrieDedup can deduplicate reads up to 270-fold faster than pairwise comparison at a cost of 32-fold higher memory usage.Conclusions The TrieDedup algorithm may facilitate PCR deduplication, barcode or UMI assignment, and repertoire diversity analysis of large-scale high-throughput sequencing datasets with its ultra-fast algorithm that can account for ambiguous bases due to sequencing errors.

引用

页数：13

共 50 条

[1] A Fast Microbial Detection Algorithm Based on High-Throughput Sequencing
Li, Jiangyu
Liu, Yang
Mao, Yiqing
Wang, Xiaolei
Zhao, Dongsheng
[J]. INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING BIOMEDICAL ENGINEERING, AND INFORMATICS (SPBEI 2013), 2014, : 336 - 343
[2] An Ultra-fast Universal Incremental Update Algorithm for Trie-based Routing Lookup
Yang, Tong
Mi, Zhian
Duan, Ruian
Guo, Xiaoyu
Lu, Jianyuan
Zhang, Shenjiang
Sun, Xianda
Liu, Bin
[J]. 2012 20TH IEEE INTERNATIONAL CONFERENCE ON NETWORK PROTOCOLS (ICNP), 2012,
[3] LightAssembler: fast and memory-efficient assembly algorithm for high-throughput sequencing reads
El-Metwally, Sara
Zakaria, Magdi
Hamza, Taher
[J]. BIOINFORMATICS, 2016, 32 (21) : 3215 - 3223
[4] BIC-seq: a fast algorithm for detection of copy number alterations based on high-throughput sequencing data
Xi, Ruibin
Luquette, Joe
Hadjipanayis, Angela
Kim, Tae-Min
Park, Peter J.
[J]. GENOME BIOLOGY, 2010, 11
[5] BIC-seq: a fast algorithm for detection of copy number alterations based on high-throughput sequencing data
Ruibin Xi
Joe Luquette
Angela Hadjipanayis
Tae-Min Kim
Peter J Park
[J]. Genome Biology, 11 (Suppl 1):
[6] CloudRS: An Error Correction Algorithm of High-Throughput Sequencing Data based on Scalable Framework
Chen, Chien-Chih
Chang, Yu-Jung
Chung, Wei-Chun
Lee, Der-Tsai
Ho, Jan-Ming
[J]. 2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2013,
[7] Predicting the Number of Bases to Attain Sufficient Coverage in High-Throughput Sequencing Experiments
Deng, Chao
Daley, Timothy
Calabrese, Peter
Ren, Jie
Smith, Andrew D.
[J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2020, 27 (07) : 1130 - 1143
[8] naiveBayesCall: An Efficient Model-Based Base-Calling Algorithm for High-Throughput Sequencing
Kao, Wei-Chun
Song, Yun S.
[J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2011, 18 (03) : 365 - 377
[9] naiveBayesCall: An Efficient Model-Based Base-Calling Algorithm for High-Throughput Sequencing
Kao, Wei-Chun
Song, Yuri S.
[J]. RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, PROCEEDINGS, 2010, 6044 : 233 - 247
[10] A* fast and scalable high-throughput sequencing data error correction via oligomers
Milicchio, Franco
Buchan, Iain E.
Prosperi, Mattia C. F.
[J]. 2016 IEEE CONFERENCE ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY (CIBCB), 2016,

← 1 2 3 4 5 →