FQC: A novel approach for efficient compression, archival, and dissemination of fastq datasets

被引:12
|
作者
Dutta, Anirban [1 ]
Haque, Mohammed Monzoorul [1 ]
Bose, Tungadri [1 ]
Reddy, C. V. S. K. [1 ]
Mande, Sharmila S. [1 ]
机构
[1] Tata Consultancy Serv Ltd, TCS Innovat Labs, Biosci R&D Div, 54-B Hadapsar Ind Estate, Pune 411013, Maharashtra, India
关键词
Data compaction and compression; algorithms for biological data management; NGS data; sequencing data archival; LOSS-LESS COMPRESSION; READS;
D O I
10.1142/S0219720015410036
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Sequence data repositories archive and disseminate fastq data in compressed format. In spite of having relatively lower compression efficiency, data repositories continue to prefer GZIP over available specialized fastq compression algorithms. Ease of deployment, high processing speed and portability are the reasons for this preference. This study presents FQC, a fastq compression method that, in addition to providing significantly higher compression gains over GZIP, incorporates features necessary for universal adoption by data repositories/end-users. This study also proposes a novel archival strategy which allows sequence repositories to simultaneously store and disseminate lossless as well as (multiple) lossy variants of fastq files, without necessitating any additional storage requirements. For academic users, Linux, Windows, and Mac implementations (both 32 and 64-bit) of FQC are freely available for download at: https://metagenomics.atc.tcs.com/compression/FQC.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Efficient algorithms for the compression of FASTQ files
    Saha, Subrata
    Rajasekaran, Sanguthevar
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2014,
  • [2] A new efficient referential genome compression technique for FastQ files
    Sanjeev Kumar
    Mukund Pratap Singh
    Soumya Ranjan Nayak
    Asif Uddin Khan
    Anuj Kumar Jain
    Prabhishek Singh
    Manoj Diwakar
    Thota Soujanya
    [J]. Functional & Integrative Genomics, 2023, 23
  • [3] A new efficient referential genome compression technique for FastQ files
    Kumar, Sanjeev
    Singh, Mukund Pratap
    Nayak, Soumya Ranjan
    Khan, Asif Uddin
    Jain, Anuj Kumar
    Singh, Prabhishek
    Diwakar, Manoj
    Soujanya, Thota
    [J]. FUNCTIONAL & INTEGRATIVE GENOMICS, 2023, 23 (04)
  • [4] SQUISH: Near-Optimal Compression for Archival of Relational Datasets
    Gao, Yihan
    Parameswaran, Aditya
    [J]. KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 1575 - 1584
  • [5] A novel gradient approach for efficient data dissemination in wireless sensor networks
    Han, KH
    Ko, YB
    Kim, JH
    [J]. VTC2004-FALL: 2004 IEEE 60TH VEHICULAR TECHNOLOGY CONFERENCE, VOLS 1-7: WIRELESS TECHNOLOGIES FOR GLOBAL SECURITY, 2004, : 2979 - 2983
  • [6] Enabling solutions for an efficient compression of PET-CT datasets
    Signoroni, Alberto
    Masneri, Stefano
    Riccardi, Andrea
    Castiglioni, Isabella
    [J]. 2009 IEEE NUCLEAR SCIENCE SYMPOSIUM CONFERENCE RECORD, VOLS 1-5, 2009, : 2747 - +
  • [7] Applying Delta Compression to Packed Datasets for Efficient Data Reduction
    Zhang, Yucheng
    Jiang, Hong
    Wang, Chunzhi
    Huang, Wei
    Chen, Meng
    Zhang, Yongxuan
    Zhang, Le
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2024, 73 (01) : 73 - 85
  • [8] Efficient ingest of datasets in a two-stage archival process: Easy-store
    Kramer, Rutger
    Sesink, Laurent
    [J]. Archiving 2006: Final Program and Proceedings, 2006, : 105 - 108
  • [9] EWOk: Towards Efficient Multidimensional Compression of Indoor Positioning Datasets
    Klus, Lucie
    Klus, Roman
    Torres-Sospedra, Joaquin
    Lohan, Elena Simona
    Granell, Carlos
    Nurmi, Jari
    [J]. IEEE TRANSACTIONS ON MOBILE COMPUTING, 2024, 23 (05) : 3589 - 3604
  • [10] An efficient and novel data clustering and run length encoding approach to image compression
    Oswald, C.
    Haritha, E.
    Akash Raja, A.
    Sivaselvan, B.
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (10):