TrieDedup: a fast trie-based deduplication algorithm to handle ambiguous bases in high-throughput sequencing

被引:1
|
作者
Hu, Jianqiao [1 ,2 ]
Luo, Sai [1 ,3 ,4 ,5 ]
Tian, Ming [1 ,3 ]
Ye, Adam Yongxin [1 ,3 ,4 ]
机构
[1] Boston Childrens Hosp, Program Cellular & Mol Med, Boston, MA 02115 USA
[2] Univ Washington, Dept Biol, Seattle, WA USA
[3] Harvard Med Sch, Boston, MA 02115 USA
[4] Boston Childrens Hosp, Howard Hughes Med Inst, Boston, MA 02115 USA
[5] Tsinghua Univ, Sch Basic Med Sci, Beijing, Peoples R China
基金
美国国家卫生研究院;
关键词
Deduplication; Ambiguous bases; Trie; Prefix tree; Next-generation sequencing; FORMAT;
D O I
10.1186/s12859-024-05775-w
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background High-throughput sequencing is a powerful tool that is extensively applied in biological studies. However, sequencers may produce low-quality bases, leading to ambiguous bases, 'N's. PCR duplicates introduced in library preparation are conventionally removed in genomics studies, and several deduplication tools have been developed for this purpose. Two identical reads may appear different due to ambiguous bases and the existing tools cannot address 'N's correctly or efficiently.Results Here we proposed and implemented TrieDedup, which uses the trie (prefix tree) data structure to compare and store sequences. TrieDedup can handle ambiguous base 'N's, and efficiently deduplicate at the level of raw sequences. We also reduced its memory usage by approximately 20% by implementing restrictedDict in Python. We benchmarked the performance of the algorithm and showed that TrieDedup can deduplicate reads up to 270-fold faster than pairwise comparison at a cost of 32-fold higher memory usage.Conclusions The TrieDedup algorithm may facilitate PCR deduplication, barcode or UMI assignment, and repertoire diversity analysis of large-scale high-throughput sequencing datasets with its ultra-fast algorithm that can account for ambiguous bases due to sequencing errors.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] A Fast Microbial Detection Algorithm Based on High-Throughput Sequencing
    Li, Jiangyu
    Liu, Yang
    Mao, Yiqing
    Wang, Xiaolei
    Zhao, Dongsheng
    [J]. INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING BIOMEDICAL ENGINEERING, AND INFORMATICS (SPBEI 2013), 2014, : 336 - 343
  • [2] An Ultra-fast Universal Incremental Update Algorithm for Trie-based Routing Lookup
    Yang, Tong
    Mi, Zhian
    Duan, Ruian
    Guo, Xiaoyu
    Lu, Jianyuan
    Zhang, Shenjiang
    Sun, Xianda
    Liu, Bin
    [J]. 2012 20TH IEEE INTERNATIONAL CONFERENCE ON NETWORK PROTOCOLS (ICNP), 2012,
  • [3] LightAssembler: fast and memory-efficient assembly algorithm for high-throughput sequencing reads
    El-Metwally, Sara
    Zakaria, Magdi
    Hamza, Taher
    [J]. BIOINFORMATICS, 2016, 32 (21) : 3215 - 3223
  • [4] BIC-seq: a fast algorithm for detection of copy number alterations based on high-throughput sequencing data
    Xi, Ruibin
    Luquette, Joe
    Hadjipanayis, Angela
    Kim, Tae-Min
    Park, Peter J.
    [J]. GENOME BIOLOGY, 2010, 11
  • [5] BIC-seq: a fast algorithm for detection of copy number alterations based on high-throughput sequencing data
    Ruibin Xi
    Joe Luquette
    Angela Hadjipanayis
    Tae-Min Kim
    Peter J Park
    [J]. Genome Biology, 11 (Suppl 1):
  • [6] CloudRS: An Error Correction Algorithm of High-Throughput Sequencing Data based on Scalable Framework
    Chen, Chien-Chih
    Chang, Yu-Jung
    Chung, Wei-Chun
    Lee, Der-Tsai
    Ho, Jan-Ming
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2013,
  • [7] Predicting the Number of Bases to Attain Sufficient Coverage in High-Throughput Sequencing Experiments
    Deng, Chao
    Daley, Timothy
    Calabrese, Peter
    Ren, Jie
    Smith, Andrew D.
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2020, 27 (07) : 1130 - 1143
  • [8] naiveBayesCall: An Efficient Model-Based Base-Calling Algorithm for High-Throughput Sequencing
    Kao, Wei-Chun
    Song, Yun S.
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2011, 18 (03) : 365 - 377
  • [9] naiveBayesCall: An Efficient Model-Based Base-Calling Algorithm for High-Throughput Sequencing
    Kao, Wei-Chun
    Song, Yuri S.
    [J]. RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, PROCEEDINGS, 2010, 6044 : 233 - 247
  • [10] A* fast and scalable high-throughput sequencing data error correction via oligomers
    Milicchio, Franco
    Buchan, Iain E.
    Prosperi, Mattia C. F.
    [J]. 2016 IEEE CONFERENCE ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY (CIBCB), 2016,