Reference-free compression of next-generation sequencing data in FASTQ format

被引:0
|
作者
Tan, Li [1 ]
Sun, Jifeng [2 ]
机构
[1] Guangzhou Maritime Inst, Sch Informat & Commun Engn, Guangzhou, Guangdong, Peoples R China
[2] South China Univ Technol, Sch Elect & Informat Engn, Guangzhou, Guangdong, Peoples R China
关键词
NGS; DEMT model; DSRC; Lossless compression; LOSS-LESS COMPRESSION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we present a new reference-free and lossless approach to compress next-generation sequencing (NGS) data in FASTQ format, splitting the input FASTQ data into three parts of metadata, short reads and quality scores, and eliminating their redundancy independently according to their own characteristics. Experiments were conducted on five real-world NGS data. The results show that the proposed algorithm has better compression gain as compared to the previous state of the art compression algorithms.
引用
收藏
页码:10 / 13
页数:4
相关论文
共 50 条
  • [1] FQZip: Lossless Reference-Based Compression of Next Generation Sequencing Data in FASTQ Format
    Zhang, Yongpeng
    Li, Linsen
    Xiao, Jun
    Yang, Yanli
    Zhu, Zexuan
    [J]. PROCEEDINGS OF THE 18TH ASIA PACIFIC SYMPOSIUM ON INTELLIGENT AND EVOLUTIONARY SYSTEMS, VOL 2, 2015, : 127 - 135
  • [2] Transformations for the compression of FASTQ quality scores of next-generation sequencing data
    Wan, Raymond
    Vo Ngoc Anh
    Asai, Kiyoshi
    [J]. BIOINFORMATICS, 2012, 28 (05) : 628 - 635
  • [3] Compression of FASTQ and SAM Format Sequencing Data
    Bonfield, James K.
    Mahoney, Matthew V.
    [J]. PLOS ONE, 2013, 8 (03):
  • [4] Reference-free transcriptome assembly in non-model animals from next-generation sequencing data
    Cahais, V.
    Gayral, P.
    Tsagkogeorga, G.
    Melo-Ferreira, J.
    Ballenghien, M.
    Weinert, L.
    Chiari, Y.
    Belkhir, K.
    Ranwez, V.
    Galtier, N.
    [J]. MOLECULAR ECOLOGY RESOURCES, 2012, 12 (05) : 834 - 845
  • [5] Reference-Free Imputation of Targeted Next-Generation Sequence Datasets
    Nampally, Arun
    Kim, Joseph
    Proffitt, Eric
    Palovcak, Eugene
    Lacoste, Alix
    [J]. 14TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, BCB 2023, 2023,
  • [6] SPRING: a next-generation compressor for FASTQ data
    Chandak, Shubham
    Tatwawadi, Kedar
    Ochoa, Idoia
    Hernaez, Mikel
    Weissman, Tsachy
    [J]. BIOINFORMATICS, 2019, 35 (15) : 2674 - 2676
  • [7] No-Reference Compression of Genomic Data Stored In FASTQ Format
    Bhola, Vishal
    Bopardikar, Ajit S.
    Narayanan, Rangavittal
    Lee, Kyusang
    Ahn, TaeJin
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM 2011), 2011, : 147 - 150
  • [8] A reference-free approach to analyse RADseq data using standard next generation sequencing toolkits
    Heller, Rasmus
    Nursyifa, Casia
    Garcia-Erill, Genis
    Salmona, Jordi
    Chikhi, Lounes
    Meisner, Jonas
    Korneliussen, Thorfinn Sand
    Albrechtsen, Anders
    [J]. MOLECULAR ECOLOGY RESOURCES, 2021, 21 (04) : 1085 - 1097
  • [9] Generation of Artificial FASTQ Files to Evaluate the Performance of Next-Generation Sequencing Pipelines
    Frampton, Matthew
    Houlston, Richard
    [J]. PLOS ONE, 2012, 7 (11):
  • [10] NGS-FC: A Next-Generation Sequencing Data Format Converter
    Yu, Chunjiang
    Wu, Wentao
    Wang, Jing
    Lin, Yuxin
    Yang, Yang
    Chen, Jiajia
    Zhu, Fei
    Shen, Bairong
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2018, 15 (05) : 1683 - 1691