TrieDedup: a fast trie-based deduplication algorithm to handle ambiguous bases in high-throughput sequencing

被引:1
|
作者
Hu, Jianqiao [1 ,2 ]
Luo, Sai [1 ,3 ,4 ,5 ]
Tian, Ming [1 ,3 ]
Ye, Adam Yongxin [1 ,3 ,4 ]
机构
[1] Boston Childrens Hosp, Program Cellular & Mol Med, Boston, MA 02115 USA
[2] Univ Washington, Dept Biol, Seattle, WA USA
[3] Harvard Med Sch, Boston, MA 02115 USA
[4] Boston Childrens Hosp, Howard Hughes Med Inst, Boston, MA 02115 USA
[5] Tsinghua Univ, Sch Basic Med Sci, Beijing, Peoples R China
基金
美国国家卫生研究院;
关键词
Deduplication; Ambiguous bases; Trie; Prefix tree; Next-generation sequencing; FORMAT;
D O I
10.1186/s12859-024-05775-w
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background High-throughput sequencing is a powerful tool that is extensively applied in biological studies. However, sequencers may produce low-quality bases, leading to ambiguous bases, 'N's. PCR duplicates introduced in library preparation are conventionally removed in genomics studies, and several deduplication tools have been developed for this purpose. Two identical reads may appear different due to ambiguous bases and the existing tools cannot address 'N's correctly or efficiently.Results Here we proposed and implemented TrieDedup, which uses the trie (prefix tree) data structure to compare and store sequences. TrieDedup can handle ambiguous base 'N's, and efficiently deduplicate at the level of raw sequences. We also reduced its memory usage by approximately 20% by implementing restrictedDict in Python. We benchmarked the performance of the algorithm and showed that TrieDedup can deduplicate reads up to 270-fold faster than pairwise comparison at a cost of 32-fold higher memory usage.Conclusions The TrieDedup algorithm may facilitate PCR deduplication, barcode or UMI assignment, and repertoire diversity analysis of large-scale high-throughput sequencing datasets with its ultra-fast algorithm that can account for ambiguous bases due to sequencing errors.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] A fast, fully automated cell segmentation algorithm for high-throughput and high-content screening
    Fenistein, D.
    Lenseigne, B.
    Christophe, T.
    Brodin, P.
    Genovesio, A.
    [J]. CYTOMETRY PART A, 2008, 73A (10) : 958 - 964
  • [32] Development of a High-throughput Sequencing Platform for Detection of Viral Encephalitis Pathogens Based on Amplicon Sequencing
    Li, Zhang Ya
    Zhe, Su Wen
    Chen, Wang Rui
    Yan, Li
    Feng, Zhang Jun
    Hui, Liu Sheng
    He, Hu Dan
    Xiao, Xu Chong
    Yu, Yin Jia
    Kai, Yin Qi
    Ying, He
    Fan, Li
    Hong, F. U. Shi
    Kai, Nie
    Dong, Liang Guo
    Yong, Tao
    Tao, Xu Song
    Feng, Ma Chao
    Yu, Wang Huan
    [J]. BIOMEDICAL AND ENVIRONMENTAL SCIENCES, 2024, 37 (03) : 294 - 302
  • [33] Development of a High-throughput Sequencing Platform for Detection of Viral Encephalitis Pathogens Based on Amplicon Sequencing
    ZHANG Ya Li
    SU Wen Zhe
    WANG Rui Chen
    LI Yan
    ZHANG Jun Feng
    LIU Sheng Hui
    HU Dan He
    XU Chong Xiao
    YIN Jia Yu
    YIN Qi Kai
    HE Ying
    LI Fan
    FU Shi Hong
    NIE Kai
    LIANG Guo Dong
    TAO Yong
    XU Song Tao
    MA Chao Feng
    WANG Huan Yu
    [J]. Biomedical and Environmental Sciences, 2024, 37 (03) : 294 - 302
  • [34] MICRA: an automatic pipeline for fast characterization of microbial genomes from high-throughput sequencing data
    Caboche, Segolene
    Even, Gael
    Loywick, Alexandre
    Audebert, Christophe
    Hot, David
    [J]. GENOME BIOLOGY, 2017, 18
  • [35] MICRA: an automatic pipeline for fast characterization of microbial genomes from high-throughput sequencing data
    Ségolène Caboche
    Gaël Even
    Alexandre Loywick
    Christophe Audebert
    David Hot
    [J]. Genome Biology, 18
  • [36] High-Throughput Sequencing of Viable Microbial Communities in Raw Pork Subjected to a Fast Cooling Process
    Yang, Chao
    Che, You
    Qi, Yan
    Liang, Peixin
    Song, Cunjiang
    [J]. JOURNAL OF FOOD SCIENCE, 2017, 82 (01) : 145 - 153
  • [37] Fast intelligent cell phenotyping for high-throughput optofluidic time-stretch microscopy based on the XGBoost algorithm
    Zhao, Wanyue
    Guo, Yingxue
    Yang, Sigang
    Chen, Minghua
    Chen, Hongwei
    [J]. JOURNAL OF BIOMEDICAL OPTICS, 2020, 25 (06)
  • [38] Sequencing-Based High-Throughput Neuroanatomy: From Mapseq to Bricseq and Beyond
    Xiaoyang Wu
    Qi Zhang
    Ling Gong
    Miao He
    [J]. Neuroscience Bulletin, 2021, 37 (05) : 746 - 750
  • [39] Sequencing-Based High-Throughput Neuroanatomy: From Mapseq to Bricseq and Beyond
    Wu, Xiaoyang
    Zhang, Qi
    Gong, Ling
    He, Miao
    [J]. NEUROSCIENCE BULLETIN, 2021, 37 (05) : 746 - 750
  • [40] Analysis of Panax ginseng miRNAs and Their Target Prediction Based on High-Throughput Sequencing
    Wang, Yingfang
    Peng, Mengyuan
    Chen, Yanlin
    Wang, Wenjuan
    He, Zhihua
    Yang, Zemin
    Lin, Zhiyun
    Gong, Mengjuan
    Yin, Yongqin
    Zeng, Yu
    [J]. PLANTA MEDICA, 2019, 85 (14-15) : 1168 - 1176